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1.  Introduction 

_:p  There  are  three  major  aspects  to  our  project.  The  first  concerns  the 
development  of  a  procedural  approach  to  the  layout  of  VLSI  circuits.  The  second 
is  the  continuing  investigation  of  the  census  language.  Finally,  the  third  is  in  the 
area  of  testing  of  VLSI  circuits. 

2.  Procedural  Approach  to  VLSI 

2.1.  ALI2  [LaPaugh,  Mata] 

A  complete  version  of  ALI2  is  now  operational.  It  includes  a  variety  of  sup¬ 
port  packages.  These  include  a  library  of  basic  cells  and  a  switch-level  simulator 
that  is  “built”  into  ALI2.  This  simulator  is  novel  in  that  it  can  detect  a  number 
of  “problems”  in  circuits  such  as  race  conditions. 

ALI2  is  now  being  used  and  evaluated  by  a  number  of  VLSI  designers  at 
Princeton.  It  is  also  being  used  in  a  beginning  VLSI  course  at  Princeton.  We 
hope  to  get  feedback  from  these  users  shortly  on  ALI2  and  the  procedural  ap¬ 
proach  to  VLSI  design. 

Already  work  is  under  way  on  improvements  to  AL12.  One  area  of  improve¬ 
ment  is  the  elimination  of  any  need  for  design  rule  checkers.  Layouts  generated 
by  ALI2  are  usually  design  rule  correct  but  this  is  not  guaranteed  by  the  system. 
It  appears  possible  to  modify  ALI2  slightly  to  make  all  generated  layouts  design 
rule  correct. 

2.2.  Clay  [Lip ton,  Lucas,  North,  Souvaine] 

Clay,  another  procedural  approach  to  VLSI  design,  is  now  operational.  We 
are  currently  using  it  in  several  design  projects.  Indeed,  a  number  of  simple 
designs  have  completed  successfully  the  full  design-fabrication  cycle.  We  have 
also  just  made  Clay  available  to  other  institutions  and  have  a  number  of  users 
outside  Princeton. 

2.3.  Layout  Algorithms  [Huang,  North,  Steiglitz] 

The  layout  algorithms  used  by  ALI2  and  Clay  are  quite  prone  to  “thrash¬ 
ing”  the  paging  system  of  the  VAX.  For  this  reason  a  number  of  independent 
projects  are  underway  to  improve  on  the  current  implementations.  Clay  uses  a 
hierarchical  approach.  Clay  allows  the  user  to  break  their  layout  up  into  several 
pieces  that  can  be  separately  compiled  into  layouts.  This  still  preserves  the  total 
flexibility  of  Clay  layouts.  Another  more  theoretical  approach  is  based  on  a  new 
algorithm  for  layout.  For  an  important  class  of  layout  problems,  this  algorithm 
can  guarantee  few  (relatively)  page-faults.  Work  is  now  underway  to  implement 
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2.4.  Referee  [Ltpton] 

Referee  is  a  new  program  for  circuit  comparison.  It  uses  a  n ew  definition  of 
when  two  circuits  are  the  same.  This  definition  is  more  “forgiving”  then  the  usu¬ 
al  definition  based  on  graph  isomorphism.  Referee  also  has  a  guaranteed  running 
time  that  is  linear  in  the  size  of  the  circuit.  We  are  planning  in  the  future  to  in¬ 
tegrate  it  into  the  ALI2/Clay  systems. 

(  2.5.  Applications  of  Clay 

2.5.1.  Graphics  Engine  [Dobkin,  Field,  Souvaine] 

Progress  on  the  design  of  a  VLSI  engine  for  doing  graphics  has  concentrated 
on  the  design  of  custom  chips  for  scan  conversion  of  lines.  Using  Clay  adders  of 
various  types  have  been  designed.  These  can  be  combined  to  yield  complete  cir- 
1  cuits  for  both  Bresenbam’s  algorithm  and  Field’s  algorithm  for  anti-aliased  scan 

conversion  of  lines,  scenes,  and  cubic  curves. 

Work  has  begun  on  interfacing  these  circuits  to  other  portions  of  our  graph¬ 
ics  system.  The  goal  is  to  have  the  pseudo-triangle  as  the  basic  building  block. 
I  This  structure  consists  of  the  interconnection  of  three  vertices  via  curves  of  arbi¬ 

trary  degree  (<4).  Circuits  to  compute  these  functions  are  lacking  in  even  high- 
end  state  of  the  art  graphics  systems. 

2.5.2.  Recursive  Layout  [Lucas,  Souvaine,  Steiglitz] 

,  *  Clay  has  been  used  to  design  a  number  of  recursive  circuits.  These  include: 

(1)  comparers,  (2)  tally  circuits,  (3)  various  adders,  and  others. 

The  advantages  of  using  Clay  for  such  designs  are  several.  First  of  all,  once 
the  basic  cells  have  been  described,  the  entire  layout  is  generated  by  a  single  re¬ 
cursive  function  call.  Since,  in  Clay,  the  calls  remain  flexible  until  the  layout  is 
I  complete,  proper  interconnections  among  the  cells  is  assured.  Moreover,  by 

changing  a  single  parameter,  an  8-bit,  a  16-bit,  a  128-bit,  or  any  size  layout  may 
be  generated. 

Equally  important,  however,  is  the  ease  with  which  we  can  resize  transistors 
in  order  to  improve  speed.  A  number  of  layouts  have  used  this  feature  and  Cry- 
I  stal  to  dramatically  improve  their  performance:  one  chip  was  speedup  from 

200ns  to  53ns  by  just  such  a  resizing  which  is  trivial  with  Clay.  We  are  now 
working  on  automating  this  whole  resizing  step. 

I  3.  Census 

There  are  two  main  projects  under  way  here. 

3.1.  Top/Down  [Loprestl,  North] 

This  project  is  investigating  the  use  of  the  census  approach  to  parallel  com- 
f  putation  as  a  way  to  speed  up  a  large  class  of  computations.  The  essential  idea 

is  that  rather  than  speeding  up  the  inner  loop  of  a  computation  as  is  usual,  we 
plan  to  take  a  top-down  approach.  Here,  the  problem  is  decomposed  at  a  high 
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level  into  independent  (or  nearly)  computations  on  loosely  coupled  processors. 
We  are  currently  investigating  the  classes  of  problems  that  match  this  approach. 

3.2.  Afl  [Garcia-Molina,  Honeyman,  Llpton] 

This  project  is  investigating  a  new  approach  to  the  design  of  a  super  com¬ 
puter:  we  propose  to  interconnect  large  number  of  memories  with  a  very  small 
number  of  processors.  Our  central  thesis  is  that  a  machine  with  a  hugh  amount 
of  physical  memory,  in  the  tens  of  billions  of  bytes,  can  outperform  other  super¬ 
computers  on  many  important  tasks.  The  project  has  already  found  a  new  novel 
way  to  implement  such  a  machine  which  we  call  ESP.  Work  is  now  underway  to 
develop  and  expand  our  understanding  of  the  issues  involved  in  building  such  a 
machine. 


4.  Testing 

Work  on  VLSI  testing  is  continuing  along  two  basic  lines. 

4.1.  Structured  Testing  [Steiglits,  Vergis] 

Work  here  has  recently  found  large  classes  of  regular  layouts  that  are  easily 
testable.  These  include  many  important  classes  of  systolic  arrays. 

4.2.  Bipartite  Testing  [LaPaugh,  Lipton] 

Work  continues  on  this  approach  to  design  for  testability.  The  earlier 
methods  have  now  been  extended  to  CMOS  circuits.  Work  also  is  continuing  on 
building  test  circuits. 

In  addition,  a  new  but  related  approach  to  testing  is  now  being  developed. 
It  uses  a  special  nand  gate  that  is  similar  to  that  used  in  the  Bipartite  Method. 
However,  it  avoids  the  potential  doubling  of  the  number  of  gates  found  in  the  Bi¬ 
partite  Method.  The  additional  cost  is  the  number  of  test  vectors  is  no  longer 
constant  but  in  worst  case  is  linear  in  the  size  of  the  circuit.  The  key,  however, 
is  as  before  it  is  computationally  easy  to  find  the  test  vectors  that  guarantee 
100?o  coverage. 
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Molding  Clay:  A  Manual  for  the  Clay  Layout  Language 

Stephen  C.  North 

Department  of  Electrical  Engineering  and  Computer  Science 
Princeton  University 
Princeton,  New  Jersey  08540 

Bell  Laboratories 
Murray  Hill,  NJ  07974 


The  Clay  VLSI  Design  Language 

Clay  is  a  procedural  language  for  NMOS  VLSI  layout  design. f  A  layout  in  Clay  is  created 
by  writing  a  program  which  describes  the  devices  and  wires  in  the  layout,  and  where  they  are 
placed.  The  Clay  system  translates  the  algorithmic  description  into  CIF  (Caltech  Intermediate 
Format). 

There  are  several  advantages  of  a  programming  language  over  a  graphical  editor  for  VLSI 
design.  A  programming  language  provides  a  means  for  controlling  the  complexity  of  the  design 
task.  For  instance,  a  structured  design  language  can  help  make  large  layouts  managable  by  top- 
down  decomposition,  similar  to  the  way  large  programs  can  be  written.  A  language,  as  opposed 
to  an  editor,  also  provides  a  vehicle  for  implementing  VLSI  layout  algorithms,  and  allows  the 
designer  to  write  generic,  parameterized  cells  (such  as  transistors,  inverters,  PL  As,  channel 
routers,  etc.)  and  then  instantiate  them  many  times. 

A  disadvantage  to  our  approach  is  that  the  designer  cannot  see  his  design  as  he  is  writing 

the  layout  program,  except  by  going  through  the  translate-layout-plot  cycle.  So  he  must  have  a 

mental  (or  physical)  picture  of  the  design  he  is  trying  to  create,  and  then  express  it  as  statements 

in  the  programming  language.  This  is  primarily  a  problem  in  writing  low-level  cells,  which 

t  The  fundamental  design  of  Clay  is  independent  of  the  fabrication  technology;  ai  extension  for  CMOS  is 
planned 
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contain  many  random  objects  and  which  often  must  be  optimized  for  small  area.  Higher  level 
structures  tend  to  be  more  regular  and  are  more  naturally  described  algorithmically.  Neverthe¬ 
less,  we  have  had  satisfactory  experience  with  designing  low-level  cells,  and  since  Clay  can  handle 
arbitrary  C1F  objects,  it  is  very  easy  to  access  cells  created  by  other  layout  tools  such  as  a  graphi¬ 
cal  editor. 

Clay  was  written  as  a  package  of  C  data  types  and  functions.  Before  trying  to  write  a  Clay 
program,  the  designer  should  already  know  C.  We  chose  C  as  a  base  language  because  we  did  not 
want  to  try  to  re-invent  all  the  features  of  a  structured  programming  language  not  related  to  the 
layout  task  and  C  is  flexible  enough  to  support  the  data  types  and  function  interfaces  we  need. 
Further,  the  Unix  C  compiler  is  efficient  enough  to  support  large  layouts. 

Clay  adds  two  new  data  types  to  C:  wire#  and  tyml#/*.  Wires  are  horizontal  or  vertical 
runs  of  some  layer  (metal,  polysilicon,  or  diffusion).  Wires  declared  in  a  Clay  program  are  of  fixed 
width  but  variable  length.  The  length  is  determined  by  the  Clay  system  itself  as  part  of  the 
translation  into  a  layout.  A  wire  can  be  thought  of  as  a  stretchable  line  segment  with  a  fixed- 
width  field  around  it.  A  symbol  is  a  small  rigid  piece  of  CIF,  such  as  a  transistor  or  contact. 
Symbols  interconnect  wires.  Thus,  a  layout  consists  entirely  of  stretchable  wires  meeting  at  sym¬ 
bols.  It  is  intentionally  not  possible  to  place  any  object  at  an  absolute  location.  This  flexible 
placement  of  objects,  similar  to  stick  diagrams,  is  an  important  feature  of  Clay. 

The  Clay  language  primitives  (which  we  will  describe  in  detail  later)  create  wires  and  sym¬ 
bols  and  control  their  placement  in  the  layout.  The  execution  of  a  Clay  program  produces,  not 
the  CIF  layout,  but  a  list  of  the  wires  and  symbols  it  created,  and  eonotrointo  on  their  placement. 
A  program  called  the  oolvtr  converts  these  into  CIF. 

To  get  started,  consider  the  following  simple  Clay  program  illustrating  the  basic  primitives 
(line  numbers  are  not  part  of  the  program). 


•  3  * 


1:  #tnelude  “/va/cl&y /lib/header. h” 

2:  maln() 

3:  { 

4:  wlretype  w; 

5:  symbottype  s; 

6.  w  =  wire(POLY,MIN); 

7:  s  *  •ymbol(“in  peon  tact”); 

8:  ordcred(LR); 

9:  plsc«(s,  NULL, NULL, w, NULL); 

10:  plaeejs,  w, NULL, NULL, NULL); 

II:  kaveorderelO; 

12:  } 


Line  (1)  is  the  include  needed  for  the  definition  of  Clay  data  types.  Every  Clay  program  must 
have  this.  Line  (4)  is  the  declaration  of  a  wire  variable.  A  wire  variable  takes  on  actual  wires  as 
values.  A  call  to  wire  creates  a  new  wire  in  the  layout,  but  does  not  say  anything  about  where 
to  place  it,  nor  how  long  it  is.  Thus,  the  call  to  wire  in  line  (6)  sets  w  to  a  new  minimum  width 
wire  of  polysilicon.  In  NMOS,  the  legal  layers  are  POLY,  METAL,  and  DIFF.  Widths  larger 
than  MIN  can  be  given  as  multiples  of  the  predefined  constant  LAMBDA,  for  instance: 


w  =  wlrc(METAL,10  •  LAMBDA); 


To  conform  with  the  convention  that  CIF  dimensions  are  given  in  centimicrons  for  2.0  micron 
NMOS,  LAMBDA  is  currently  defined  as  200.  For  a  different  fabrication  process  or  CIF  scaling 
factor,  LAMBDA  can  be  redefined.  The  width  of  a  wire  is  the  maximum  of  the  user-supplied 
width  and  the  process  minimum.  That  is,  a  wire  can’t  be  narrower  than  the  design  rules  allow, 
but  it  can  be  wider. 


Line  (5)  is  the  declaration  of  a  symbol  variable.  As  described  before,  a  symbol  is  a  rigid 
object  that  can  be  placed  under  the  control  of  a  Clay  program.  A  symbol  variable  is  set  to  such 
an  object  by  a  call  to  symbol,  as  in  line  (7).  The  argument  to  symbol  is  the  Unix  name  of  a 
CIF  file.  Clay  uses  a  symbol  as  a  template,  to  be  copied  and  placed.  The  call  to  symbol  does 
not  put  anything  in  the  layout,  but  merely  sets  the  value  of  a  symbol  variable  so  the  symbol  can 
be  referenced  later.  Since  symbol  opens,  reads,  and  closes  the  CIF  file  to  get  the  symbol 
definition,  it  is  better  to  set  symbol  variables  once  at  the  start  of  a  program,  rather  than  within  a 
loop. 
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An  important  concept  in  Clay  is  that  wires  and  symbols  are  placed  inside  ordered  contexts. 
The  ordered  primitive  creates  a  new  context.  Its  argument  specifies  the  kind  of  context  to  be 
created:  TB  for  top-to-bottom,  BT  for  bottom-U^top,  LR  for  left-to-right,  and  RL  for  right-to- 
left.  A  context  is  a  virtual  box  in  the  layout.  A  context’s  scope  extends  until  a  matching 
leaveordered  primitive  appears.  In  our  example,  the  left-to-right  ordered  context  created  in  line 
(8)  continues  until  line  (11).  Usually,  ordered  and  leaveordered  will  enclose  a  block  of  code, 
but  since  they  are  executable  primitives  (and  not  syntactic  delimiters  of  a  static  scope)  a  Clay 
program  can  create  new  contexts  dynamically. 

Within  a  context,  the  place  primitive  places  wires  and  copies  of  symbols.  The  general  form 
of  this  primitive  is: 

plaee(sym,a,b,c,d); 

The  first  argument  is  a  symbol;  the  other  four  are  wires  or  the  constant  NULL.  The  call  to  place 
has  several  effects.  First,  it  forces  the  wires  to  meet  at  a  point:  a  must  enter  from  the  left,  i  must 
enter  from  the  top,  c  must  enter  from  the  right,  and  i  must  enter  from  below  (see  Fig.  1). 
Second,  it  places  a  copy  of  the  symbol  op  top  of  this  point.  Third,  the  symbol  is  constrained  to 
lie  entirely  within  the  current  context.  Fourth,  tymbol*  ore  ordered  o#  they  ore  placed .  It  is  this 
interplay  between  ordered  and  place  that  gives  Clay  its  power.  The  user  need  never  explicitly 
constraint  the  position  of  any  wire  or  symbol.  The  positions  are  implied  by  the  sequence  of  primi¬ 
tives  that  appear  in  an  ordered  context. 


lb 


a 


c 
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Figure  i 
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If  a  wire  argument  to  plae«  is  NULL,  then  there  is  bo  wire  entering  from  that  direction. 
Note  that  the  four  wire  arguments  need  not  be  distinct:  if  a  wire  goes  through  a  symbol,  not  ter¬ 
minating  inside  it,  then  it  can  enter  from  both  top  and  bottom,  or  left  and  right.  The  symbol 
argument  can  also  be  NULL,  which  forces  the  wires  to  meet  and  orders  the  point  in  the  current 
context,  but  does  not  create  a  copy  of  a  symbol.  Note  that  a  symbol  cannot  be  used  where  the 
wires  do  net  meet  at  a  point  (see  Fig.  2).  Cases  like  this  can  be  created  by  a  Clay  function. 


Figure  2 


Once  a  Clay  program  (foo.c)  has  been  written,  it  can  be  translated  into  a  CIF  file  by  the  fol¬ 
lowing  commands: 

%  cl  foo.c 
%  a.out 
%  solve 

cl  compiles  the  source  program  and  loads  it  with  the  Clay  runtime  library.  Cl  is  a  slightly 
modified  version  of  the  cc  compiler,  with  the  same  options.  The  execution  of  a.out  creates  the 
constraint  files.  These  are  put  in  the  current  directory  as  dot  files  since  usually  the  programmer 
need  never  refer  to  them.  Since  their  names  are  fixed  (for  instance:  .xconstraint,  .yconstraint, 
.definitions)  each  Clay  program  should  reside  in  its  own  directory.  Finally,  aolvtr  reads  these  files 
and  outputs  a  file  named  ouLeif  containing  the  layout.  The  plot  of  the  example  program  is  given 
in  Fig.  3. 
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Figure  3 

Correcting  Errors 

Syntax  errors  are  detected  by  the  compiler. 

Run  time  errors  are  sometimes  self-explanatory  and  sometimes  aren’t.  If  the  run  time  sys¬ 
tem  complains  about  a  “negative  constraint,”  a  Clay  primitive  has  written  a  constraint  which 
that  the  right  endpoint  of  a  wire  is  to  the  left  of  its  left  endpoint.  Also,  GIF  symbols  not  in  the 
special  format  described  later  will  be  rejected.  The  run  time  system  can  core  dump  for  the  same 
reasons  an  ordinary  C  program  does,  such  as  referencing  an  uninitialized  wire  or  symbol  variable. 
*db  can  be  used  to  track  down  some  of  these  errors. 

The  most  common  diagnostic  from  solver  is  the  infamous  “cycle  error.’’  This  means  that  the 
Clay  program  wrote  an  inconsistent  set  of  constraints;  there  is  no  possible  layout  satisfying  them 
For  example:  a  cycle  error  occurs  if  the  Clay  program  states  that  wire  A  is  both  above  and  below 
wire  B.  Look  for  incorrect  place  and  ordered  commands,  and  misuse  of  wires  that  are  function 
arguments.  Referencing  an  uninitialized  wire  variable  may  also  cause  the  solver  to  give  a  warning 
about  a  “coordinate  variable  number  out  of  bounds.” 

Many  runtime  or  solve  errors  can  be  diagnosed  with  the  aid  of  the  trace  package.  The  trace 
writes  a  log  of  the  Clay  primitives  called  (with  indentation  according  to  the  nesting  of  ordered 
contexts)  on  stderr.  net_trace(level)  turns  the  trace  on  or  off.  The  level  can  be  TR_NOTRACE, 
TR_PARTIAL,  or  TR.TRACEALL.  If  enabled,  trace  also  checks  for  dangling  wires  at  the  end  of 
the  program  run.  These  are  wires  with  one  end  unconstrained.  Since  the  solver  tries  to  move  the 
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endpoints  of  wires  as  far  down  and  to  the  left  as  possible,  the  free  end  will  stretch  to  the  boun¬ 
dary  of  the  layout,  even  if  it  crosses  over  the  other  endpoint  (“snaps  back  ’ —  see  Fig.  4).  The 
solver  gives  warnings  about  wires  that  are  degenerate  or  snap  back  and  does  not  place  them  in 
the  layout.  If  a  wire  is  created  by  a  call  to  ext_wire,  rather  than  wire,  its  external  name  will  be 
printed  in  the  trace.  The  format  of  the  call  is  ext_wire(layer, width, external^name)  where 
external_name  is  a  string. 

Finally,  the  solver  can  generate  illegal  CIF  if  there  is  a  bad  symbol  file. 


Figure  4 


More  Primitives 

Although  ordered  and  primitive  are  powerful  enough  to  describe  most  Clay  designs,  other 
primitives  are  provided  for  access  to  internal  data  structures,  efficiency,  or  flexibility. 

drop(sym,a,b,c,d)  takes  the  same  arguments  as  place,  drop  glues  up  to  four  wires 
together  in  a  point,  and  puts  in  a  symbol  over  this  point,  but  does  not  write  any  other  con¬ 
straints.  drop  is  appropriate  when  symbols  are  being  dropped  in  over  a  regular  structure  which 
has  already  been  constrained.  For  instance,  a  PLA  can  be  created  by  first  laying  out  a  grid  of 
wires,  and  then  dropping  in  contacts  and  transistors  where  needed  to  define  its  functions.  Because 
of  the  risk  of  design  rule  violations,  drop  should  be  used  carefully. 

overrlde(i)  changes  the  default  separation  of  wires  and  symbols  in  an  ordered  context.  The 
default  separation  is  the  maximum  imposed  by  any  design  rule,  which  is  3  •  LAMBDA  in  the 


-8- 


current  NMOS  version  of  Clay.  This  means  that  Clay  is  not  very  smart  about  bow  close  the 
design  rules  allow  objects  to  be  packed;  it  assumes  the  worst  case,  override  is  intended  for 
hacking  low-level  cells,  where  lambdas  count.  Its  integer  argument  is  in  centimicrons,  but  it  can 
be  given  as  a  multiple  of  LAMBDA.  Obviously  it  is  possible  to  create  layouts  with  design  rule 
violations  if  override  is  used  incorrectly.  Note  that  the  argument  is  the  change  in  separation- 
negative  to  decrease  it,  positive  to  increase  it. 

layer(w),  wldth(w),  and  dlreetlon(w)  have  wiretype  variable  arguments,  layer  returns 
the  layer  of  the  wire,  direction  returns  its  direction.  (TB,  LR,  BT,  or  RL).  width  returns  the 
width  of  the  wire  in  multiples  of  LAMBDA.  These  primitives  can  be  helpful  when  writing  a  func¬ 
tion  that  needs  to  find  out  the  type  of  its  wire  arguments. 

poaltlon(w,type)  constrains  wires  to  run  outside  the  layout  (the  outermost  context),  w  is  a 
wiretype  variable;  type  is  one  of  the  following:  enter.  left,  enter_rlght,  enter_top, 
enter.bottom,  thruJLR,  or  thru.TB.  enter  forces  one  end  of  the  wire  to  be  outside  and 
the  other  end  inside;  thru  forces  both  ends  of  the  wire  to  be  outside. 

freewlre(w)  frees  the  storage  allocated  by  a  call  to  wire.  This  is  28  bytes  per  wire  in  the 
current  version  of  Clay,  freewlre  can  be  called  when  the  memory  requirement  of  a  Clay  pro¬ 
gram  becomes  excessive  due  to  the  creation  of  many  wires. 

mark(w, string)  is  a  symboltype-valued  function.  The  one-line  CIF  symbol  it  returns  puts 
the  string  argument  as  a  label  on  the  same  layer  as  the  wire,  using  a  Berkeley  extension  to  stan¬ 
dard  CIF.  Placing  this  symbol  somewhere  on  the  wire  will  label  it  in  the  plot.  Note  ihat  some 
CIF  tools  (such  as  the  crystal  timing  simulator)  will  not  recognize  a  label  as  being  on  a  wire  if  it 
is  placed  on  its  endpoint.  Instead,  the  wire  should  pass  through  the  symbol. 

eonnect(a,b,c,d)  forces  up  to  four  wires  to  meet  at  a  point  and  also  places  the  appropriate 
symbol  to  electrically  connect  them,  connect  is  usually  preferable  to  place  since  it  automati¬ 
cally  creates  symbols  when  needed,  and  therefore  is  easier  than  creating  them  by  hand  and  less 


error-prone. 


-9- 


Uiefiil  Things  to  Know 

The  functions  startup  and  endup  are  automatically  called  by  the  Clay  runtime  system  at 
the  beginning  and  end  of  its  execution.  These  primitives  should  not  be  called  by  the  user;  we 
mention  them  only  so  their  names  can  be  avoided. 

If  the  environment  variable  claypath  is  defined,  the  symbol  primitive  will  use  this  to  search 
for  symbol  files,  claypath  should  contain  the  name  of  one  or  more  directories,  separated  by 
colons.  These  directories  are  searched  in  order  if  the  initial  open  of  the  file  in  the  current  direc¬ 
tory  fails. 

The  GIF  for  a  symbol  must  be  in  the  following  canonical  format.  The  first  GIF  command 
must  be  a  comment  containing  two  numbers  which  give  the  size  of  the  symbol  (x  and  y)  in  cen- 
timicrons.  The  size  is  measured  as  distance  from  (0,0).  So  the  first  line  of  a  symbol  of  size  1000  x 
1000  centered  over  the  origin  would  be  “(500  500);“.  The  next  section  is  a  list  of  macro 
definitions.  The  last  section  is  a  list  of  macro  calls  and  box  creation  commands.  Note  that  some 
GIF  extensions  which  affect  scanning  the  CIF  file,  such  as  the  Berkeley  GIF  include  command,  are 
not  supported.  Also,  when  a  symbol  is  placed,  the  GIF  origin  (0,0)  is  centered  over  the  point.  At 
present,  symbols  must  be  symmetric,  that  is,  the  boundaries  of  the  symbol  cannot  be  off-center, 
although  the  contents  of  the  symbol  can  be  arbitrary. 

A  major  annoyance  in  the  current  release  of  Glay  is  that  there  is  way  to  change  orienta¬ 
tion.  For  instance,  separate  symbols  for  horizontal  and  vertical  pass  transistors  are  needed.  Like¬ 
wise,  if  you  have  written  a  channel  router  in  Clay  with  the  channels  running  horizontally,  you 
cannot  easily  obtain  from  this  a  router  with  channels  running  vertically  except  by  editing  a  copy 
of  the  function,  making  the  necessary  changes.  We  intend  to  correct  this  deficiency  in  a  future 
version  of  Clay. 

In  addition  to  TB,  LR,  BT,  and  RL,  contexts  may  be  NONE  ordered.  The  initial  context  of 
a  Glay  program,  before  the  first  ordered  call,  b  NONE  ordered.  Symbols  placed  in  a  NONE 
ordered  context  are  constrained  to  lie  inside  it,  but  are  not  constrained  with  respect  to  each  other. 
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Solving  Constraint* 

To  write  low-level  primitives  or  modify  the  Clay  system,  you  must  understand  how  Clay 
generates  the  CIF  layout.  The  layout  is  contained  entirely  within  the  first  quadrant  of  the  Carte¬ 
sian  coordinate  plane.  When  a  symbol,  wire,  or  context  is  created,  it  is  assigned  coordinate  vari¬ 
ables.  Since  a  symbol  is  placed  over  a  point,  it  has  two  coordinate  variables  (an  x  coordinate  and 
a  y  coordinate).  A  wire  has  three  coordinates:  a  horizontal  wire  has  two  x  coordinates  associated 
with  it,  and  a  y  coordinate;  similarly  a  vertical  wire  has  one  x  coordinate  and  two  y  coordinates. 
The  bounding  box  of  a  context  has  two  x  coordinates  and  two  y  coordinates.  The  Clay  primitives 
can  then  control  the  positions  of  objects  by  stating  constraints  on  the  values  of  their  coordinate 
variables.  For  instance,  let  vertical  wire  *  have  x  coordinate  variable  4,  and  wire  k  have  x  coordi¬ 
nate  variable  12.  (Coordinate  variable  names  are  non-negative  integers,  x  variables  are  even;  y 
variables  are  odd.)  If  the  Clay  program  states  that  the  center  line  of  b  is  at  least  5  LAMBDAs  to 
the  right  of  the  center  line  of  a,  where  LAMBDA  is  defined  as  200,  then  the  execution  of  the  Clay 
program  creates  the  constraint: 

In  fact,  all  constraints  generated  by  Clay  are  of  the  form: 

Constraints  on  x  coordinate  variables  are  written  in  binary  in  the  file  .xconstraint.  Constraints  on 
y  variables  are  written  in  .yconstraint.  Also,  since  endpoints  of  wires  can  be  glued  together,  as  by 
drop  or  plwee,  the  Clay  program  writes  a  list  of  commands  in  .unionfind  which  force  two  coordi¬ 
nate  variable  numbers  to  be  synonyms.  In  addition,  a  list  of  the  wires  and  symbols  created  is  put 
in  .creation,  and  a  list  of  symbol  definitions  is  put  in  .definitions. 

To  obtain  a  CIF  layout,  the  solver  first  reads  .unionfind  and  builds  a  union-find  tree.  Next 
on  separate  passes  it  processes  .xconstraint  and  .yconstraint  to  find  a  layout  having  smallest  total 
area,  using  a  linear-time  algorithm  based  on  topological  sort.  Finally,  solver  writes  a  CIF  file  by 
loading  the  CIF  macros  for  symbols  (using  .definitions)  and  writing  box  creation  commands  for 
wires  and  macro  calls  for  symbols  (using  .creation). 
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For  dynamic  storage  allocation  in  the  solver,  the  maximum  internal  coordinate  variable 
number  and  symbol  numbers  referenced  by  the  Clay  program  are  written  in  .maxpofile,  along 
with  the  coordinate  variable  numbers  of  the  outermost  context.  These  coordinate  numbers  are 
needed  for  hierarchical  solving,  described  in  the  next  section. 

Hierarchical  Solving 

In  the  Clay  examples  given  so  far,  an  entire  layout  was  described  by  a  single  Clay  program, 
and  all  the  constraints  were  solved  in  one  run  of  the  solver.  If  a  Clay  program  creating  a  large 
layout  generates  many  objects  and  constraints,  the  run  time  of  the  solver  may  become  excessive 
and  its  memory  requirements  may  cause  page  thrashing.  To  help  avoid  this,  and  for  top-down 
refinement  of  Clay  designs,  we  allow  hierarchical  partitioning  of  Clay  layouts  into  cells,  or  non- 
overlapping  sections  of  a  layout.  A  hierarchical  layout  has  a  main  cell,  the  parent,  containing  one 
or  more  child  cells.  Each  cell  is  described  by  a  separate  Clay  program.  This  containment  is 
recursive,  so  a  child  cell  may  itself  have  children.  A  parent  and  child  cell  usually  have  wires  they 
share  that  cross  the  boundary  between  them,  called  parameter  wire*.  The  parameters  wires  and 
the  outermost  context  of  a  child  cell  are  its  externally  vioiblc  point*. 

Since  each  Clay  program  must  reside  in  its  own  directory,  we  need  a  separate  directory  for 
each  cell.  The  logical  hierarchy  of  cells  must  be  reflected  in  their  directory  names.  For  instance, 
if  alu  and  control  are  children  of  mychip ,  there  is  a  directory  mychip  with  subdirectories  alu  and 
control. 

We  also  3llow  rigid  CIF  cells  to  be  children.  A  rigid  cell  cannot  have  its  own  children. 

In  a  hierarchical  layout,  the  parent  and  child  cells  are  solved  separately.  A  child  may  affect 
the  layout  of  its  parent,  since  it  has  area  and  imposes  a  minimum  distance  between  its  parameter 
wires.  Likewise,  a  parent  may  affect  its  child  by  stretching  the  distance  between  parameter  wires. 
To  obtain  a  hierarchical  layout,  we  first  compile  and  execute  the  Clay  programs  for  all  the  cells  to 
get  constraint  files.  Then,  starting  with  the  lowest-level  children  (those  with  no  children  of  their 
own)  we  solve  to  get  a  layout  of  the  child  cell,  and  append  constraints  on  its  site  and  position  of 
parameter  wires  to  the  constraint  files  of  its  parent.  Then  we  solve  the  parents  of  these  cells,  on 
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upward  in  the  hierarchy,  until  we  have  solved  all  the  way  up  to  the  topmost  cell  (the  root)  which 
has  no  parent  of  its  own.  Then  we  can  solve  back  down  the  hierarchy,  exporting  constraints  from 
parents  to  their  children,  and  at  the  same  time  getting  euf.ct/files  for  the  individual  cells.  When 
we  have  solved  all  the  leaf  cells  on  this  downward  pass,  the  concatenation  of  all  the  cut. eif  files 
yields  the  complete  layout.  The  eifeut  command  concatenates  C1F  files  with  macro  renumbering 
and  handles  the  C1F  End  command  so  the  resulting  file  is  palatable  to  most  CIF  tools.  The  argu¬ 
ments  to  cifeut  are  names  of  files  to  be  concatenated,  and  it  writes  to  its  standard  output  (which 
can  be  redirected). 

The  6olver  works  on  only  one  cell  in  the  hierarchy  at  a  time.  That  is,  in  the  directory  of 
any  cell,  we  can  run  cclve  -u  to  solve  up,  exporting  constraints  to  the  parent,  §olve  *4  to  export 
constraints  to  children  and  get  an  cut. eif  file,  or  a  simple  colvt  to  get  cut. eif  without  affecting 
j  children.  Since  the  solver  must  be  invoked  more  than  once  on  a  hierarchical  layout,  you  may 

want  to  write  a  shell  script  to  make  this  more  convenient. 

Next  we  will  explain  how  to  define  parameter  wires  in  a  Clay  program  and  describe  the 

hierarchy  of  parent  and  child  cells.  Parameter  wires  and  child  cells  are  identified  by  name.  The 

primitive  for  creating  parameter  wires  is  ext_wire(Iayer, width  .name),  described  previously.  The 

external  name  of  a  wire  is  returned  by  the  nwme(w)  primitive.  If  w  was  created  by  wire,  not 

ext_w!re,  then  name  returns  NULL.  Each  wire  created  by  m  call  to  ext  wire  has  an  entry  in 
t 

.symtab  with  its  coordinate  numbers. 

ext_ordered(direction,name)  creates  a  context  for  a  child  cell.  The  context  is  an  exter¬ 
nally  visible  object  with  an  entry  in  .symtab.  Parameter  wires  can  be  placed  between 

! 

ext_ordered  and  leaveordered.  The  parameter  wires  between  a  cell  and  its  parent  should  be 
constrained  by  calls  to  position. 

A  floorpltn  in  a  parent  cell  directory  tells  the  solver  the  names  of  children  and  the  names  of 
*  the  parameter  wires.  The  floorplan  has  an  entry  for  each  child  cell.  The  first  line  of  each  entry  is 

of  the  form: 

I 


1 


L 
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type  childname  directory 

Type  is  either  flexible  or  rigid.  Flexible  cells  are  those  described  by  the  constraint  flies  of  a  Clay 
program  execution;  rigid  cells  are  in  CIF.  childname  is  the  name  given  in  the  axtjorderud  call. 
directory  is  the  name  of  the  subdirectory  containing  the  child.  For  sanity’s  sake,  this  should  usu¬ 
ally  be  the  same  as  childnamc.  Remember  that  there  must  be  a  separate  directory  for  each  child, 
even  if  they  are  identical  copies  of  the  same  layout.  For  instance,  if  your  layout  has  8  input  pads, 
there  must  be  a  separate  directory  for  each  instance  of  an  input  pad. 

Following  this  comes  a  list  of  the  child  cell’s  parameter  wires,  which  we  cal)  a  wolk.  The 
walk  must  totally  order  all  the  externally  visible  wire  coordinates.  The  x-coordinate  walk  comes 
first  (implicitly  beginning  with  the  left  side  of  the  cell  and  ending  with  the  right),  then  the  y- 
coordinate  walk  (which  likewise  implicitly  begins  with  the  bottom  and  ends  with  the  top  of  the 
*  cell).  The  walk  is  given  simply  by  listing  the  wire  coordinates,  terminated  by  a  *.  A  wire  coordi¬ 

nate  is  one  of  the  following: 

x(wirename)  x-coordinate  of  vertical  wire 

xl(wirename)  left  x-coordinate  of  horironta!  wire 

1  x2(wirename)  right  x-coordinate  of  horizontal  wire 

y(wirename)  y-coordinate  of  a  horizontal  wire 

yl(wirename)  lower  y-coordinate  of  vertical  wire 

y2(wirename)  upper  y-coordinate  of  vertical  wire 

where  wirenome  is  the  name  given  to  the  wire  in  the  call  to  ext_wire.  For  instance,  the 
i 

floorplan  entry  of  the  flexible  cell  in  Fig.  5  is  shown  below: 
flexible  onebitadder  onebitadder 

x2(gnd)  x2(cin)  x(sum)  x(datal)  x(data2)  xl(cout)  xl(vdd) 

a 

|  y(gnd)  y2(datal)  y2(data2)  y(cin)  y(cout)  yl(sum)  y(vdd) 

a 

Since  the  floorplan  walk  imposes  a  total  ordering  on  the  parameter  wires,  even  if  they  are  not 
otherwise  related  to  each  other  in  the  Clay  program,  you  may  need  to  fine-tune  a  floorplan  entry 
I  if  the  ordering  causes  the  cell  to  stretch  unnecessarily.  For  instance,  in  the  floorplan  entry  for 

onebitadder,  the  y-walk  forces  tout  to  be  at  the  same  y-value  or  above  tin,  even  though  the  Clay 
program  may  allow  them  to  float.  If  the  cell  stretches  badly  because  of  this,  y(cout)  should 
I  appear  before  y(cin)  in  the  floorplan. 
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dn 

gnd 


datal  data2 


Figure  5 

The  walk  for  a  rigid  CDF  cell  is  slightly  different,  since  we  also  need  to  specify  the  exact 
separation  of  parameter  wires  measured  between  wire  centers,  and  the  wire  types  and  widths. 
These  are  given  in  centimicrons  (not  LAMBDA*).  The  walk  begins  and  ends  with  the  separation 
from  the  cell  boundary.  The  entry'  for  each  wire  is  the  wire  coordinate,  its  type  (’in’,  ’p\  or  f<T), 
and  its  width.  The  separation  appears  between  the  wire  entries.  The  floorplan  entry  for  the  rigid 
cell  of  Fig.  6  is  given  below: 

rigid  datacell  dataeell 

0  x2(v)  m  1600  0  x2(g)  m  1600  20000  x(in)  p  400  10000  xl(out)  d  400  0 

0  y2(in)  p  400  800  y(v)  m  1600  6000  y(out)  d  400  20000  y(g)  m  1600  800 

The  OIF  for  a  rigid  child  cell  must  be  placed  in  the  file  in.cij  in  it*  directory.  During  solve  up, 
the  constraints  implied  by  the  rigid  cell's  walk  are  exported  to  its  parent.  Then,  during  solve 
down,  rather  than  exporting  constraints,  the  parent  checks  that  it  has  not  tried  to  change  the 
separation  of  the  rigid  cell’s  parameter  wires,  and  writes  •uf.ct/ in  the  child  directory  by  translat¬ 
ing  in.cif  to  its  position  in  the  layout,  tn.ei/ must  be  in  the  canonical  format  described  earlier  for 
symbols. 
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There  are  two  other  Clay  primitives  written  for  separate  compilation. 
arrayn^Dc(nametoQmber)  simply  concatenates  the  string  conversion  of  numier  to  fismr,  for 
convenience  in  giving  names  to  arrays  of  parameter  wires.  For  instance,  arrajrname(>‘data”t5) 
is  "dataS".  put_lift_elfl(  cello  ame,ciffile,floorplan_header)  makes  it  easier  to  incorporate  rigid  cells 
in  Clay  programs.  Its  arguments  are  string  pointers.  The  first  is  the  name  of  an  ext_ordered  con* 
text.  The  second  is  the  name  of  a  CIF  file,  and  the  third  is  the  name  of  a  file  containing  the 
floorplan  header  (X  and  Y  walks).  put_b»_clf  creates  a  subdirectory  for  the  child  cell  (if  needed), 
copies  the  named  CIF  file  to  tn.es/,  and  appends  an  entry  for  the  rigid  cell  to  the  floorplan  in  the 
current  directory.  Since  the  floorplan  is  modified  every  time  put_tn_elf  is  called,  you  will  need 
to  make  a  backup  of  floorpUn  and  restore  it  whenever  the  Clay  program  is  run.  Otherwise, 
put_ln_cif  will  append  multiple  copies  of  the  same  entry  to  Jtccrpian. 

A  Simple  Router  In  Clay 

As  another  Clay  example,  consider  the  following  function  which  is  a  one-sided  channel 
router  using  a  greedy  allocation  strategy.  Metal  wires  enter  from  the  top;  the  router  connects  the 
nets  on  poly.  The  function  call  is  gcroute(n,a,w)  where  n  is  the  number  of  wires,  •  is  the  connec¬ 
tion  list  given  as  an  array  of  n  integers,  and  w  is  an  array  of  n  wires,  already  separated  left  to 
right.  a\{]  gives  the  index  in  u>  of  the  next  wire  to  the  right  of  w[t]  to  be  connected  in  the  net,  or 
>1  if  w{t]  is  the  rightmost  wire  in  the  net. 

The  router  has  two  phases.  In  the  first  phase  it  assigns  channels  to  the  nets.  To  do  this,  it 
works  from  left  to  right  in  the  net  list,  assigning  the  lowest-numbered  channel  available.  The 
variable  po$  stores  the  current  position  in  the  left-to-right  scan  for  wire  nets. 
cA«nne/_num^er(pof]  stores  the  channel  number  (0  is  topmost)  chosen  for  the  connection  of  the 
net  whose  leftmost  terminal  is  at  po #.  ekanjt]  stores  the  index  of  the  rightmost  wire  in  the 
cumnt  net  connected  on  channel  When  po9>ck*n[{[,  the  tth  channel  can  be  reused.  No 
actual  layout  is  done  during  the  first  phase. 

During  the  second  phase,  the  router  works  from  top  to  bottom,  laying  out  each  channel. 
Since  the  wires. in  w  are  assumed  to  be  previously  separated  from  each  other,  each  channel  is 
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ordered(NONE).  The  router  then  looks  through  the  ch«f»ne(_nttmler  nr  my  to  find  wires  which 
are  the  leftmost  members  of  nets  running  on  the  channel,  and  connects  the  net  via  a  poly  wire. 
All  dynamically  allocated  data  structures  are  freed  before  the  function  exits. 


1: 

2: 

3: 

4: 

5: 

6: 

7: 

8: 

0: 

10 

11 

12 

13 

14 

15 

16 

17 

18 
10 
20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 


/• 

•  greedy  one-sided  channel  router 

•  metal  wires  enter  from  top  and  nets  are  connected  by  horizontal 

•  poly  runs. 

•  n  is  the  number  of  wires 

•  a  is  the  connection  list.  a[i|  gives  the  index  of  the  next  wire 

•  to  the  right  of  w[i]  in  the  net  or  -1  if  it  is  the  rightmost. 

•  w  is  an  array  of  wires  to  be  connected,  they  must  already  be 

•  constrained  in  left  to  right  order! 

•/ 

#  Include  ‘‘/vb/clay/lib/header.h" 

#lnclude  <stdio.h> 

#deflne  NCHANNELS  10  /*max  number  of  channels  route  can  have*/ 

geroute(n,a,w) 

Int  n,a[l; 
wire  type  w[|; 

Int  cban  [NCHANNELS],  /*rightmost  terminal  connected*/ 

lnt  nextavail  **  0;  /*next  available  channel  (lowest  numbered)*/ 

Int  i,j,k,pos,prev; 

Int  *phase, *ch  an  nel_n  umber; 

Int  maxused  «  -1;  /*highest  channel  number  actually  used*/ 

wlretype  c; 


/•phase  keeps  track  of  which  wires  have  been  connected*/ 
phase  «  (lnt  *)  maUoe(n  *  sizeof(lnt)); 

for  (i  «  0;  i  <  n;  i++)  phaseji]  =  0;  /*mark  everyone  as  not  seen  yet*/ 
/•channels  umber  remembers  to  which  channel  the  leftmost  wire 
in  a  net  list  has  been  connected*/ 
channel_number  =  (lnt  *)  malloe(n  *  sizeof(int)); 

for  (i  «  0;  i  <  n;  i++)  channel_number[il  **  *1;  /*mark  as  not  used  yet*/ 
for  (i  «  0;  i  <  NCHANNELS;  i++)  chanjij  ■*  -1;  /*not  used  yet*/ 

/•first  phase  is  to  compute  connections  to  channels*/ 
pos  *  0; 
while  (pos  <  n) 

If  (ajpos]  <  ■  pos)  /*can't  go  from  right  to  left  in  the  net  list*/ 
f|prlntf(stderrt“route:  attempt  to  connect  term  %d  to  %d0',pos, 

channel_number[po$]  «  nextavail; 
i  ■*  alposl; 
phase[pos|  —  1; 

while  (a[i)  !™  -1)  /*scan  middle  contacts*/ 
phase(i)  —  1; 
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49: 

50: 

51: 

52: 

53: 

54: 
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56: 

57: 
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77: 
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80: 
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82: 

83: 

84: 

85: 

86: 

87: 

88: 

89: 

90: 

91: 

92. 

93: 

94:  } 


}  *  -  41; 

pha$e[i]  =*  1;  /‘rightmost  contact*/ 

chan[nextavail]  «  i;  /‘mark  how  far  we  used*/ 

/•move  to  next  position  and  find  ncxtavail  channel*/ 
while  ((phaselposj)  tt  pos  <  n)  pos++; 

If  (pos  =*  n)  break;  /‘all  done*/ 
prev  m*  nextavail; 

for  (nextavail  «  0;  nextavail  <  NCHANNELS;  nextavail++) 

If  (chan[nextavail]  <  pos)  break; 

If  (nextavail  —  NCHANNELS) 

ffc>rlntf(stderr, “couldn't  route  in  %d  channelsO\NCHANNELS); 

«*H(.l); 

If  (nextavail  >  maxu&ed)  maxnsed  nextavail;  /‘remember  max*/ 


/•second  phase  is  to  create  layout*/ 
ordered(TB);  /*go  by  channels*/ 
for  (i  *=  0;  i  <=  maxused;  i++) 

{ 

ordered(NONE);  /‘use  ordering  of  w|]  within  channels*/ 

/‘could  speed  up  by  having  a  list  per  channel;  not  worth  the  trouble*/ 
for  (j  =  0;  j  <  n;  j++) 

{ 

If  (cbannel_number[j|  !=  i)  continue;  /‘ignore  if  not  leftmost*/ 
/•do  leftmost  terminal*/ 

c  «=  w!re(POLY,MIN);  /*poly  wire  for  channel*/ 

eonnect(MJLLfw[jJ,c,NULL); 
k  -  a|jj; 

whlle(a(kj  !=*  -1)  /"do  middle  terminals*/ 

cono«et(cfw[klfc,NULL); 

)  k  -  a|k); 

/•do  rightmost  terminal*/ 
connect(c,w|k)>TJLL?NULL); 

ffeowlre(c); 

leaveordered(); 

} 

leaveorderedf); 

free(phase); 

froe(channel_number); 


I 


A  plot  of  a  layout  created  by  this  function  is  given  in  Fig.  7. 


Figure  7 
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1.  Introduction 

In  this  paper  we  describe  the  main  features  and  usage  of  a  language  designed  at  Princeton 
to  automate  the  layout  of  VLSI  circuits.  The  language  is  called  ALI2  and  has  been  operational 
for  some  months  at  Princeton.  The  language  ALII,  also  developed  at  Princeton  was  a  forerunner 
to  AL12. 

The  main  thesis  in  the  ALI  project  is  that  VLSI  design  can  be  profitably  thought  of  as  a 
programming  task,  as  opposed  to  a  geometric  editing  task.  We  believe  that  making  layout  design 
similar  to  software  design  has  many  advantages  and  that  much  is  to  be  gained  by  consciously 
attempting  to  apply  our  knowledge  about  programming  to  this  new  activity.  We  have  thus  tried 
to  create  tools  for  the  VLSI  designer  that  incorporate  many  useful  features  of  the  software 
development  tools  that  we  are  familiar  with. 

The  main  feature  of  ALI2  as  a  layout  language  is  that  it  allows  its  user  to  design  layouts  at 
a  conceptual  level,  in  which  only  the  topological  relations  between  the  layout  components  can  be 
specified.  Absolute  positions  of  layout  components  cannot  be  specified. 

S.  An  overview  of  ALI2 

AL12  programs  are  compiled  by  first  translating  the  AL12  statements  into  standard  Pascal. 
Partly  as  a  consequence  of  this  arrangement  and  partly  for  aesthetic  reasons,  ALI2  programs  look 
very  much  like  Pascal  programs. 

The  objects  manipulated  by  ALI2  programs  can  be  classified  naturally  into  two  categories: 
those  that  a  normal  Pascal  program  can  manipulate  (which  will  be  called  Pascal  objects )  and 
those  that  are  specific  to  ALI2  (ALI2  objects).  There  are  three  AL12  objects:  cells ,  boxes,  and 
wires.  ALI2  programs  can  also  manipulate  aggregates  of  wires,  just  as  Pascal  programs  can  mani¬ 
pulate  aggregates  of  variables  using  structured  types.  Although  ALI2  programs  will  typically 
manipulate  all  three  kinds  of  ALI2  objects,  the  final  product  of  an  ALI2  program  is  a  layout  con¬ 
sisting  entirely  of  wires.  Cells  and  boxes  are  simply  used  as  ways  to  express  the  relations  between 
groups  of  wires  in  a  structured  and  systematic  way. 

A  cell  in  ALI2  is  a  prototype  for  a  rectangular  section  of  a  layout.  In  a  cell  definition,  the 
user  describes  a  prototype  of  a  rectangular  layout  piece.  In  a  cell  creation,  also  called  instantia¬ 
tion,  the  user  requests  the  insertion  of  an  instance  of  a  previously  defined  cell  in  a  given  environ¬ 
ment.  Multiple  instances  of  a  prototype  can  be  created.  It  is  possible  to  define  a  cell  prototype 
whose  content  and  structure  depends  on  the  values  of  parameters  which  will  be  supplied  to  the 
prototype  at  run-time.  The  sizes  and  shapes  of  actual  instances  of  a  given  cell  will  then  vary 
according  to  the  "actual  parameters"  provided  when  the  instance  is  created.  Thus,  ALI2  cells  are 
very  much  like  the  familiar  parameterized  procedures  and  functions. 
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Each  cell  instance  is  enclosed  in  a  cell  hounding  hor,  cells  are  thus  restricted  to  have  rec¬ 
tangular  shape.  Cell  boundaries  may  not  overlap,  nor  may  they  be  crossed  by  any  wires.  Wires 
will  either  be  entirely  contained  within  a  given  cell  instances,  or  lie  entirely  outside  it.  Cell  boun¬ 
daries  therefore  impose  a  strict  hierarchy  on  the  arrangement  of  wires  in  a  layout. 

Wires  are  rectilinear  objects  which  lie  on  a  specific  layer ,  have  a  given  width,  and  carry  a 
specified  signal.  Wires  are  used  to  interconnect  cells  and  must  have  both  of  their  endpoints  lying 
on  ceil  boundaries. 


Fit-  *  *  Four  ae  par  ate  refit  and  the  reault  of  conueetUg  them 

The  entire  layout  generated  by  an  ALI2  program  is  itself  actually  an  instance  of  a  single  cell 
defined  by  the  program.  An  ALI2  program  produces  a  set  of  linear  inequalities  involving  the  coor¬ 
dinates  of  the  endpoints  of  the  wires  and  boxes  in  the  layout  as  variables.  These  inequalities, 
which  embody  the  relations  between  the  wires  and  boxes  of  the  layout,  are  then  solved  to  gen¬ 
erate  the  positions  and  sizes  of  the  layout  elements.  The  program  also  produces  connectivity 
information  about  the  wires  in  the  layout.  This  information  can  then  be  used  by  a  switch  level 
simulator  that  predicts  the  behavior  of  the  circuit  as  laid  out  without  having  to  perform  the  usual 
'‘node  extraction"  analysis  on  the  resulting  layout. 


Fig .1  -  U/oit  produc'd  by  aa  ALII  program 


-chip  (output); 

wl  retype  polywire  -  wire  (poly,  ?*1aebda,  ftu  1 1  •  1gn«  1 1  s 
dlffwtr*  -  wire  (dlff,  ZMeebds,  nulletgnalli 
•#talwir«»  w)r*  < acta  1 . 4*  lasibda  ,  nu11stgna1)| 
f  U*wir«t  (In  layer)  -  bus  wl  :  polywire; 

w2i  MUlwIrs; 

w3r  wir*  <lr,  ■ImUtkMr),  su11s(«niUt 
w4  :  MUtwtrs; 
v5:  polywlre; 

end ; 

vlrsvar  11,  rri  f 1 vtvfrsa  (poly): 

coll  contact  (left  1i  u(rs:  top  t:  arlt#;  right  ri  wire;  bcttos  bi  wire): 
begin 

cr«at«  ayscontact  (  111,  Itl,  Irl,  Ibl  >  (false) 

•nd : 

C*1 1  Invarttr  (  l#ft  1:  flvevlres;  right  n  ftvewfres  )| 
wlrovar  dirfl,  dlffT.  d1ff3:  dlffwlre; 
beg  i n 

ordartd  ttob  do  begin 

crtit*  contact  (  l.wl.  nullwlre,  r.wl,  nullwlre  )i 
cr*»t*  contact  i  l.wZ,  nul Wire,  r. w2,  dlffl  ); 
create  ayapullup  (  nullllst,  Idlffll.  Ir.w3l,  Idlfffcl  1  (4); 
create  syst r an« * stor  (  M.v3l.  Idiff2l,  nullilat,  Idfff3l  >  ffal**): 
create  contact  (  l.w4.  oiff3.  r.wi,  nul  Wire  ): 

create  contact  (  l.w5,  nuHwIre.  r.w5,  nullwlre  I; 

end  iorderec'J  : 

•nd : 

cell  ck  1  (  le*t  1:  f  We^lres;  r  ight  r:  f  leawlrai  ); 

wlrtvi'  po'yl .  pe’y?:  pclyvfre: 

C  1  f  f  i  :  d  I  f  fwi  **e  ; 

•etl:  aet*  Wire; 

begin 

ordered  ttofc  do  begin 

create  contact  l.wl,  nul Wire,  r.wl,  pelyl  ); 

cr**t*  fyscontact  <  I  1 . v2 I ,  Ipclyli,  Ir.wZl,  IpolyZl  )  (true); 

orcered  ito-  do  beoln 

create  rctatedlTf  syst  -  ms  ( st  r>-  {  IpolyZt,  Idlffll,  nullilat.  Il.w3l  )  (false): 
c-eate  cor  +  ert  (  d*f':.  m-1  W-;re,  nullwlre,  eetl  ); 
end  < o^cer  c  c  . 

create  contact  1  nullwlre.  met  1  .  r.w?.  nullwlre  ); 
c  r  c  3 1  c  ccrtitt  1  .  w  t  .  m  '  W'*e,  r>(,  nullwlre  ); 

treeto  contact  1  l.w£.  nu*  Wire .  r.vt,  nullwlre  >; 

enc  (orcc*edl 
end  : 

ce’ ‘  cl 2  i  'eft  1:  fivewirer;  right  r;  f 1vewirti  ); 

W  I  r  e  v  *  '  p  :  *  y  1  .  fcW:-  pclyw'r*  : 

C  :  :  d  Kf/  re. 
met !  .  rtti  Wi  -e  : 

be:'.r 

o*  d*~ed  ttet  dc  bre'r 

create  c'-tict  '"l.wl.  nul  W'rt,  r  ,v’.  .  nul  Wire  1; 

ere*,  t  e  cir  :  :  t  (  1  .  w?  ,  n,l  W're.  r  .  *  2  ,  nul  Wire  >  ; 

c'cc'e*  ltcr  c:  te^-r 

c*e:  .€  rctatedTr  syjt’ir; i jto'  <  lpo1y2l,  ll.wji.  nullilat,  Idlffll  )  (f*la*>: 
crc.*  cc>  u  (  dlffl.  nJK‘re.  nullwlre,  a*etl  ); 
enc  :  o*  ce-  e  :  ‘ 

create  contact  *  nu1 Wi-e.  eetl.  r.w*,  nullwlre  >: 

cheats  jvSMrti:*,  <  M.v/  IpclyTl.  |r.w4l,  i  poly  II  )  (true); 

L^a?  ci-r.taci  <  l.w5.  pelyl.  r.wl,  nullwlre  ); 

e-i  bordered* 
enc  . 


cell  Shift  (  le't  1*:  fivewirer:  r«ght  rrt  ffvewlres  ); 
w-revtr  si-;  ,rr*C  f  ivew>e»  (diff): 
m-:  :  f  1  >  e  •'  i  r*s  (  pc  1  y  *  ; 

bee  1  r> 

c-este  irue-te-  (  11.  siml  ); 

e*r,T7  d!  '  r-1  .  ir~?  )  ; 

C r  e:  t e  wtc  <  im^,  air 3  ); 

c*  e*te  ol  '  mrr  3 .  rri; 

*rd  : 

ceV  *►  f  t-  e?  *  rte-  '  le^t  inbus:  flvewlres:  right  outbus:  ftvewlres  )  (lengtht  Integer?: 
w'revfr*  ten^  :  flvewlres  (poly); 
ber  i r 

\*  length  •  1  the-* 

cute  shift  1  inbus.  outbus  ) 
el se  beg ir 

create  fhlft  (  inbus,  t*»r  ); 

create  sr.  if  treg  1  ftc  (  te»p  .  outbus’  (  length  -  1  ) 

•r  d  (if) 

•  nd ; 

create  sh *  * tr r? i ater  i  li.  rr  )  (  3  ) 
end . 


Flf .  I  -  Am  ALU  program 
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3.  Main  Features  of  AL12 
3.1.  Type  Structure 

The  wires  manipulated  by  ALI2  are  declared  by  stating  their  name  and  their  type.  Wires 
can  be  of  a  simple  typt  (a  single  wire)  or  cf  a  structured  type  (a  group  of  wires). 

ALI2  is  a  strongly  typed  language.  The  ALI2  compiler  will  perform  type  checking  just  as 
compilers  for  conventional  languages  do.  Type  checking  can  be  effective  in  catching  certain  errors 
very  early  during  the  design  phase.  For  example,  cells  can  be  designed  to  accept  only  certain 
types  of  wires,  and  any  violation  will  be  reported  during  compilation  time  even  before  the  layout 
is  actually  produced. 

Wire  types  in  ALI2  are  parametric  types.  Parametric  types  are  designed  to  make  type 
checkiog  more  selective  or  weaker  as  the  user  wishes. 

In  ALI2  there  is  just  one  predefined  wire  type  called  wire.  This  parametric  type  has  three 
parameters  corresponding  to  the  three  attributes  of  a  wire: 

wire  (  l:  wirelayer;  w:  integer;  s:  signal ) 

The  types  wirelayer  and  signal  are  predefined  scalar  types.  The  parameter  w  stands  for  the  width 
of  the  wire. 

Other  parametric  types  can  be  defined  by  pseudo-calls  to  the  type  wire  .  For  instance,  the 
following  type  definition 

polywire  (  w:  integer  )  =  wire  (  poly,  w,  nullsignal ) 

creates  a  new  parametric  type  polywire.  All  wires  of  this  new  type  will  have  poly  as  their  layer 
and  nullsignal  as  their  signal.  The  following  wirevar  declaration 

mywirc:  polywire  (  2*lambda  ) 
creates  a  poly  wire  with  width  2*lambda. 

The  values  used  as  actual  parameters  can  be  arbitrary  expressions  of  the  appropriate  type 
These  expressions  will  be  evaluated  at  run  time.  Thus  if  k  is  a  variable  of  type  integer  defined  in 
the  current  scope,  the  following  would  have  been  a  legal  type  declaration: 

localpoly  =  polywire  (  (2*h  -  l)*lambda  ) 

Thus  the  actual  parameters  of  the  parametric  types  of  ALI2  are  bound  at  run  time.  This  allows 
for  a  great  deal  of  flexibility  and  permits  the  construction  of  dynamic  types  within  a  cell. 

There  are  three  composite  wire  types  in  ALI2:  bus,  bundle  and  list.  The  types  bus  and  bun * 
die  are  roughly  analogous  to  the  array  and  record  types  of  Pascal,  and  represent,  respectively 
aggregates  of  wires  of  the  same  type  and  aggregates  of  wires  of  different  types.  The  type  list  is 
peculiar  to  ALI2.  A  list  is  either  the  nulUist  or  an  aggregate  of  one  or  more  wires,  each  of  any 
type  whatsoever.  This  type  is  intended  to  facilitate  the  writing  of  general-purpose  cells  which 
accept  a  variable  number  of  wire  parameters. 

The  accessing  of  the  elements  of  bundles  and  buses  is  done  as  in  Pascal.  Accessing  of  lists  is 
similar  to  that  of  bundles.  ALI2  also  provides  the  user  with  a  number  of  predefined  functions 
that  take  composite  or  simple  wires  as  parameters  and  return  various  interesting  attributes  of  the 
wires  like  layer,  width,  number  of  elements,  etc. 
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3.2*  Cell  Mechanism 

Perhaps  the  mo6t  powerful  feature  of  ALI2  is  its  procedure-like  mechanism  for  the  definition 
and  creation  of  cells.  The  cell  mechanism  permits  the  users  of  ALI2  to  introduce  hierarchical 
information  into  their  programs,  and  therefore  into  the  layouts  they  describe. 

A  cell  is  a  collection  of  related  wires  enclosed  in  a  rectangular  area.  Wires  that  are  inside  a 
cell  are  of  two  types:  local  which  are  invisible  to  the  outside,  or  parameters  which  can  interact  in 
a  simple  and  well  defined  manner  with  wires  outside  the  cell. 

A  cell  is  defined  by  specifying  its  local  objects,  its  formal  parameters  and  the  relations 
among  all  of  them.  Once  a  cell  has  been  defined,  it  can  be  instantiated  as  many  times  as  desired 
by  specifying  the  actual  parameters  for  the  instance,  much  the  same  way  as  one  invokes  a  pro¬ 
cedure  or  function  in  a  procedural  language.  The  result  of  instantiating  a  cell  is  to  create  a  brand 
new  copy  of  the  prototype  described  in  the  cell  definition  with  the  formal  parameters  connected 
to  the  actual  parameters. 

The  body  of  a  cell  will  contain  Pascal  and  AL12  statements.  Cells  can  be  defined  to  be 
'external1  cells  and  separately  compiled.  Cells  can  also  be  ‘rigid1  cells  to  indicate  that  the  cell 
definition  is  not  given  textually  as  part  of  the  ALI2  program  but  instead  the  actual  layout  pro¬ 
duced  by  a  previous  instantiation  of  the  cell  is  to  be  used. 

Cells  are  instantiated  by  the  create  statement,  and  the  parameter  list  of  the  cell  contains 
both  wire  parameters  and  other  parameters. 

The  cell  mechanism  helps  in  the  automatic  generation  of  constraints  in  many  ways:  local 
wires  and  cells  are  put  inside  the  cell  bounding  box,  wire  parameters  are  separated,  and  cells  that 
share  a  parameter  are  automatically  separated. 

The  cell  mechanism  gives  the  ALI2  user  the  ability  to  describe  layouts  in  a  truly  hierarchi¬ 
cal  manner.  A  proper  AL12  design,  very  much  like  a  well  structured  program,  will  consist  of  a 
hierarchy  of  cell  instances  with  only  a  small  amount  of  information  at  a  given  level  (the  parame¬ 
ters  of  the  cell  instances  at  that  level)  being  visible  from  the  immediately  higher  level.  Cells  can 
be  written  and  debugged  separately  and  then  put  together  with  the  least  effort  to  obtain  more 
complicated  cells. 

Much  of  the  power  and  generality  of  the  cell  mechanism  of  ALI2  comes  from  the  absence  of 
absolute  positions  and  sizes  in  a  layout  specification.  We  believe  that  no  cell  mechanism  can  be 
said  to  be  truly  general  unless  the  sizes  of  its  parameter  wires  and  local  wires,  as  well  as  the  rela¬ 
tive  distances  between  them  are  determined  at  the  time  the  cell  is  instantiated. 

The  primitive  cells  in  AL12  are  the  predefined  cells.  These  are  the  cells  that  appear  at  the 
leaves  of  the  hierarchy  of  cells.  In  fact,  the  whole  layout  can  be  viewed  as  a  collection  of  primitive 
cells  joined  together  by  straight  line  wires.  The  higher  level  cells  are  just  rectangular  regions 
enclosing  subsets  of  these  primitive  cells. 

The  primitive  cells  in  ALI2  are  called  sye transistor,  syscontact  and  syspullup.  These  are 
quite  general  cells  that  implement  the  transistor,  contact,  and  pullup  of  nMOS.  Each  of  these 
primitive  cells  have  four  parameters:  four  lists  of  wires,  one  for  each  side  of  the  cell.  The  con¬ 
tents  of  an  instance  of  a  primitive  cell  will  depend  on  the  attributes  of  the  actual  parameter  wires 
used  in  that  instance  So,  these  cells  are  ‘smart1  cells  which  do  a  large  amount  of  processing 
internally. 

There  are  also  some  non-wire  parameters  to  these  celts,  which  also  contribute  to  the  con¬ 
tents  of  an  individual  instance.  The  systransistor  cell  has  a  boolean  parameter  which  determines 
whether  the  transistor  is  implanted  or  not.  The  pullup  ratio  is  a  parameter  to  the  syspullup  cell. 
The  syscontact  cell  has  a  boolean  parameter  which  determines  whether  all  the  wires  are  to  be 
electrically  connected  at  the  contact,  or  only  the  wires  on  independent  layers  are  to  be  connected 
to  each  other. 

The  reason  for  making  these  primitive  cells  general  and  thus  having  fewer  number  of  these 
cells,  is  to  keep  the  number  of  technology  dependent  features  of  the  language  small.  However,  the 
user  can  define  simpler  versions  of  these  cells  to  facilitate  their  repeated  invocation.  As 
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mentioned  earlier,  all  the  technology  dependent  features  of  ALI2  are  hidden  inside  the  design 
rules  table,  the  primitive  cells,  and  a  few  reserved  identifiers.  Even  in  the  design  rules  table  only 
the  separation  and  width  rules  are  stored,  because  the  other  design  rules  are  enforced  inside  the 
primitive  cells.  ALI2  currently  supports  only  nMOS  primitive  cells.  Design  of  cells  for  other  tech* 
nologie6  is  currently  under  investigation. 

3.3*  Placement 

Placement  is  specified  implicitly  by  create  statements,  or  explicitly  by  the  ordered  and  the 
separate  statements.  These  statements  are  used  to  relatively  place  the  various  objects  (wires  and 
bounding  boxes)  in  the  layout. 

The  ordered  statement  is  given  a  direction  of  separation,  and  a  list  of  creations  of  objects, 
and  its  effect  is  to  place  the  created  objects  ii 

ordered  It  or  do 

begin 

<  bounding  box  1  > 

<  bounding  box  2  > 

ordered  ttob  do 

begin 

<  bounding  box  S  > 

<  bounding  box  4  > 

end; 

<  bounding  box  5  > 

end 

Ftg.  4  -  ordered  statement 

The  actual  objects  that  are  ordered  within  an  ordered  statement  are  really  bounding  boxes 
Each  ordered  statement  or  cell  create  statement  is  associated  with  a  rectangular  bounding  box 
The  bounding  box  created  for  an  ordered  statement  will  enclose  the  bounding  boxes  created  for 
the  statements  within  its  scope,  and  in  addition  these  bounding  boxes  will  be  separated  in  the 
given  direction 

Since  AL12  is  an  extension  of  Pascal,  repetition  statements  of  Pascal  can  be  used  withio  an 
ordered  statement  to  create  a  succession  of  objects  that  are  separated  as  specified 

The  ordered  statement  matches  quite  well  with  the  notion  of  loor-plans  of  layouts  Once 
the  ALI2  user  has  a  rough  sketch  of  the  floor-plan  of  bis  layout,  he  can  quickly  translate  the 
sketch  into  a  series  of  nested  ordered  statements  He  can  then  refine  each  of  his  regions  in  the 
floor-plan  in  a  similar  manner. 

Both  the  cell  structure  and  the  ordered  statement  contribute  to  the  hierarchy  in  the  la*  out 
description.  However,  there  is  a  fundamental  difference  in  the  hierarchies  created  by  the  eeVs  and 
the  ordered  statement:  wires  cannot  straddle  the  bounding  box  of  a  cell,  but  the  same  is  rot  true 
for  an  ordered  statement.  Thus,  wires  are  subject  only  to  the  hierarchy  defined  by  the  c  tl  boun¬ 
daries.  The  combination  of  strict  hierarchy  of  the  cell  structure  and  the  lenient  hie  rare  iy  of  the 
ordered  statement  seems  to  give  the  ALI2  user  the  right  mixture  of  rigidity  and  lexibiiit.  that  he 
needs. 

The  other  placement  statement  -  the  separate  statement  -  is  used  to  separate  a  given  list 
of  bounding  boxes  and  wires  in  a  given  direction  of  separation.  Unlike  the  ordered  statement,  the 
separate  statement  is  not  a  structured  statement.  Its  analogy  in  programming  languages  is  the  go 
to  statement.  An  ALI2  program  can  be  written  without  using  the  separate  statement,  but  it  may 
be  used  to  make  small  local  changes  in  the  layout  to  avoid  rewriting  major  portions  of  the  AL12 
program 


the  order  in  which  they  are  created 
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4.  Layout  Issues  Addressed  In  ALU 

A  sample  of  the  main  issues  that  we  tried  to  address  with  ALI2  are  the  following; 

•  The  creation  of  an  open  ended  tool.  Most  layout  design  tools  require  the  specification  of  abso¬ 
lute  6ites  and  positions,  thus  making  the  creation  of  a  general  purpose  library  of  cells  a  hard 
task,  since  information  about  the  sizes  and  positions  of  the  cell  elements  that  can  interact  with 
the  outside  world  has  to  be  apparent  to  the  user  of  the  library.  The  absence  of  absolute  sizes 
and  positions  makes  this  problem  much  less  severe  in  ALI2.  ALU  has  been  built  on  top  of  Pas¬ 
cal,  and  is  a  full-fledged  programming  language  having  all  the  powers  of  Pascal,  thereby  making 
it  easily  extensible.  The  generation  of  tools  to  automate  the  layout  process,  such  as  simple 
routers  or  PLA  generators,  involves  writing  Pascal  routines  to  solve  some  abstract  version  of  the 
problem  and  having  done  so  invoke  ALU  cells  to  generate  the  layouts. 

•  Facilitating  the  division  of  labor.  Large  layouts  have  to  be  produced  by  more  than  one 
designer.  If  the  piece  produced  by  each  designer  is  specified  in  absolute  positions,  serious  prob¬ 
lems  are  likely  to  arise  when  the  different  pieces  are  put  together.  ALU  allows  the  partitioning 
of  tasks  in  such  a  way  that  the  designer  of  a  piece  of  the  layout  does  not  need  to  know  any¬ 
thing  about  the  positions  or  sizes  of  other  pieces  of  the  complete  layout. 

•  Facilitating  hierarchical  design.  In  ALU,  the  information  about  a  given  level  of  the  hierarchy 
needed  at  the  level  immediately  above  is  reduced  by  the  absence  of  absolute  sizes  and  positions, 
to  topological  relations  among  the  layout  elements  of  the  lower  level  visible  to  the  higher  one. 

•  Facilitating  easy  update  of  layouts .  Successful  designs  seem  to  be  more  or  less  continuously 
updated  as  improved  processes  become  available  during  their  lifetime.  Therefore,  layout  tools 
must  be  easily  amenable  to  changes  in  the  technology  or  design  rules.  The  technology  depen¬ 
dent  part  of  ALU  is  confined  to  a  few  design  rules  tables  and  primitive  cells  and  only  these 
have  to  be  rewritten  in  order  to  update  ALU  to  a  new  technology.  Future  versions  of  ALU  will 
give  its  user  the  flexibility  of  writing  one  ALU  program  to  describe  a  layout,  and  then  produc¬ 
ing  different  layouts  for  different  processes  by  just  setting  certain  appropriate  flags  when  invok¬ 
ing  the  ALU  system. 

•  Allowing  parametric  design.  Having  a  layout  design  which  produces  different  layouts  for 
different  values  of  a  set  of  parameters  is  extremely  useful.  This  is  especially  true  for  cell  designs 
which  are  used  repeatedly.  These  parameters  will  allow  decisions  about  the  detailed  charac¬ 
teristics  of  the  cell  in  a  layout  to  be  delayed  until  later  in  the  design  phase.  In  ALU,  the  cell 
mechanism  has  been  designed  so  that  the  number  as  well  as  the  attributes  of  the  wires  connect¬ 
ing  to  a  cell  can  be  parameters  of  the  cell.  In  addition,  the  cells  can  have  other  parameters  that 
affect  the  insides  of  the  cell.  ALU  offers  atl  the  wealth  of  a  full-fledged  programming  language, 
such  as  do-loops,  conditional  statements  etc.,  which  can  be  used  to  exploit  the  availability  of 
these  parameters. 

•  To  allow  easy  modification  of  layouts .  The  fact  that  absolute  sizes  and  positions  are  absent  in 
an  ALU  specification  makes  modification  of  a  layout  a  very  simple  task.  Such  modifications  are 
actually  being  made  to  a  program,  which  is  a  much  easier  task  compared  to  making  changes  in 
the  final  layout. 

5.  The  ALI2  Syatem 

The  ALU  program  takes  as  input  an  ALU  program,  with  precompiled  cells  or  rigid  cells, 
and  produces  the  layout  in  CIF  (Caltech  Intermediate  Form)  code,  or  alternatively  a  precompiled 
cell  or  a  rigid  cell,  and  connectivity  information  for  simulation.  There  is  a  switch-level  simulator, 
described  in  (11).  The  CIF  code  is  then  used  to  interface  with  other  CAD  tools,  like  Berkeley  VLSI 
Tools  |9|.  There  is  also  a  program  that  takes  a  CIF  code  and  transforms  it  into  a  rigid  cell,  to  be 
used  by  any  ALI  program.  Also,  the  node  information  for  simulation  can  be  obtained  from  the 
CIF  code. 
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Fif .  I  -  Tke  ALU  System 

^  There  are  6  steps  in  going  from  the  text  of  nn  ALI2  program  to  a  layout  in  CIF; 

1*  Translation.  The  ALI2  program  is  translated  into  Pascal. 

2-  Compilation.  The  Pascal  program  is  compiled,  prodncing  an  object  lie. 

3-  Loading.  The  object  lie  generated  by  the  pre  '  us  step  plus  several  other  standard  object 
modules  are  made  into  a  single  executable  lie. 

I  4-  Execution.  The  executable  lie  is  executed,  producing  a  lie  of  linear  constraints,  and 

optionally  connectivity  information. 

5-  Solving.  The  set  of  linear  constraints  is  run  through  the  solver  program,  and  an  internal 
representation  of  the  layout  is  produced. 

6-  Generating  CIF.  The  internal  representation  (in  lambda  units)  is  converted  to  CIF  (cen- 
timieron  units). 

*  The  whole  system  is  implemented  under  Berkeley  UNIX,  and  the  system  is  very  efficient. 

I  The  translator  was  written  using  YACC.  The  compiler  is  the  Berkeley  Pascal  Compiler.  Execution 

doesn't  take  too  much  time,  since  its  basic  operation  is  to  write  down  constraints  every  time  a  cell 
I  is  instantiated.  The  solver  takes  linear  time  relative  to  the  number  of  constraints.  CIF  generation 

1  is  straightforward.  So,  what  takes  most  of  the  time  is  read /write  operations,  specially  for  large 

4  layouts. 
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6.  Example 

One  of  the  chip#  designed  using  ALI2  was  n  n-bit  parallel  adder,  and  it  is  being  seat  for 
fabrication.  Tbe  parallel  algorithm  used  for  addition  was  borrowed  from  |13). 


Fig.  •  -  A  e-blt  adder 


This  design  illustrates  the  utility  of  several  features  of  ALI2: 

1-  General  purpose  cells  such  as  the  aarray  cell  (&],  that  was  used  to  generate  Weinberger  type 
cells,  can  be  written  and  used  very  effectively. 

2-  It  is  easy  to  parametrize  cells. 

3-  ALI2  has  tbe  power  of  a  conventional  programming  language  such  as  recursion,  iterative 
statements  and  functions. 

4-  It  is  q«ite  simple  to  divide  a  layout  task  among  several  designers. 

5-  An  AL12  program  serves  as  a  good  documentation  of  tbe  design  of  the  layout. 
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ABSTRACT 

The  problem  of  VLSI  layout  compaction  is  often 
reduced  to  finding  optimal  solutions  to  systems  of  sim¬ 
ple  linear  inequalities  and  equalities.  The  commonly 
used  algorithms  take  only  linear  time  and  space  by  the 
usual  worst  case  eomplezity  measures,  but  serious 
problems  of  page  threshing  often  occur  when  the  algo¬ 
rithms  are  run  on  systems  with  large  sets  of  con¬ 
straints.  Page  faults  must  be  taken  into  account  if  the 
performance  of  such  algorithms  is  to  be  predicted  real¬ 
istically. 

In  this  paper,  we  first  discuss  page-fault  complexity 
in  the  setting  of  paptrf  daps.  We  then  extend  the  discus¬ 
sion  to  the  case  of  constraint  systems  that  are 
hierarchically  organised.  We  present  algorithms  that 
find  optimal  solutions  to  hierarchical  constraint  sys¬ 
tems  with  strict  bounds  on  the  number  of  page-faults 
These  algorithms  also  run  in  linear  time  and  space  by 
the  usual  complexity  measures 


1.  Introduction 

As  VLSI  logical  design  problems  get  more  and  more 
complex,  there  is  a  trend  toward  hierarchical  design 
methodology  Several  languages  (e.g  AU  [LSV],  CLAY 
[N].  HILL  [LV],  SLIM  [D])  have  been  developed  for  layout 
specification  in  which  the  relative  geometric  relations 
and  interconnections  among  the  geometric  objects  in  a 
layout  are  specified  instead  of  the  absolute  positions  of 
these  objects,  and  the  layout  can  be  specified  in  a 
hierarchical  way.  Usually  a  layout  epecification  is 
stated  formally  as  follows 

(1)  SU  Compaction  Given  a  set  of  simple  Linear  ine¬ 
qualities  |  And  a  solution  such  that 

(*»)  ~  mm  (4)  is  minimised. 
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In  the  cate  where  the  set  of  constraints  consists  of  sim¬ 
ple  linear  inequalities  plus  simple  equalities,  the  prob¬ 
lem  can  be  stated  as  follows: 

(2)  SUE  Compaction.  Given  a  set  of  simple  linear  ine¬ 
qualities  f  4+4,  *  *,  {  and  a  set  of  simple  equah- 
ties  [  4W4  {.  find  a  solution  such  that 

max  (4)  —  min  (4 )  is  minimized. 

We  call  a  set  of  simple  linear  inequalities  an  SU 
system,  and  a  set  of  simple  linear  inequalities  and  sim¬ 
ple  equalities  an  SUE 

Both  problems  have  efficient  algorithms  in  terms  of 
the  usual  time  and  space  complexity  measures.  For  SL1 
compaction,  the  well  known  PERT  algorithm  runs  in 
linear  time  and  space  For  SUE  compaction,  we  make  a 
substitution  ot  variables  to  reduce  the  problem  to  com¬ 
paction  for  an  SL!  system.  However,  when  an  SL1  or 
SUE  system  generated  from  a  VLSI  layout  specification 
is  too  large  to  fit  into  the  working  space  of  a  computer, 
the  system  is  often  partitioned  and  stored  in  several 
pages.  Page-thrashing  in  execution  of  the  PERT  algo¬ 
rithm  then  becomes  a  serious  problem,  often  dominat¬ 
ing  the  rest  of  the  computation.  Experiments  indicate 
that  as  the  size  of  the  constraint  set  grows  larger  and 
larger,  the  problem  of  page-thrashing  becomes  more 
and  more  significant  Therefore,  we  must  take  page- 
faults  into  account  in  any  meaningful  measure  of  the 
complexity  of  algorithms  for  these  problems. 

One  wsy  to  avoid  page  thrashing  is  to  find  algo¬ 
rithms  that  are  efficient  In  the  length  of  the  layout 
specification,  which  is  usuaUy  much  shorter  than  that  of 
the  completely  generated  constraint  set.  Lengauer  [L] 
ahowed  that  this  it  possible  in  some,  but  not  all.  cases 
In  this  paper,  we  adopt  a  different  approach  We 
assume  that  the  generated  constraints  are  stored  expli¬ 
citly  in  the  secondary  memory,  which  is  divided  into 
peg**  of  fixed  size.  We  will  show  that  when  the  con¬ 
straint  systems  are  hierarchically  organized,  page- 
swapping  can  be  controlled  in  such  a  way  that  the 
number  of  page-faults  is  strictly  bounded  This  is  prac¬ 
tical  since  we  can  often  organize  the  generated  con- 
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atraints  in  a  way  that  reflects  the  hierarchical 
specification  of  a  VLSI  circuit. 

In  Section  2,  we  define  hierarchical  SU  and  SUE 
systems  In  Section  3,  we  discuss  pass-fault  complexity 
in  the  general  setting  of  paged-dags.  In  Section  4.  the 
basic  ideas  presented  in  Section  3  are  applied  to 
hierarchical  SL1  and  SUE  systems.  We  present  new 
algorithms  that  find  optimal  solutions  to  hierarchical 
SLI  and  SUE  systems  in  a  hierarchical  way.  with  strict 
bounds  on  the  number  of  page-faults.  These  algorithms 
run  in  linear  time  and  space,  which  is  also  best  possible 
by  the  usual  time  and  space  complexity  measures 

2.  Hierarchical  SU  and  SUE  Systems 

We  assume  that  the  simple  linear  inequalities  and 
simple  equalities  in  an  SU  or  SUE  systam  are  stored  in 
the  following  way:  for  each  variable  xx.  there  is  a  list  of 
tuples  where  +  %  x}  is  a  constraint,  a  list  of 

tuples  where  x*  +  ^  *  xt  is  a  constraint,  and  a 

list  of  elements  xt  where  x,  *  x,  is  a  constraint.  We 
assume  throughout  that  d^  >  0.  and  the  constraints  are 
acyclic 

We  also  assume  that  the  storage  structure  consists 
of  a  fast  memory,  which  we  call  the  matn  memory  (or 
the  working  space),  and  a  slower  memory,  which  we  call 
the  secondary  memory  The  secondary'  memory  is  par* 
titioned  into  pages  where  each  page  has  a  fixed  amount 
of  space.  Suppose  now  a  set  of  constraints  is  stored  in 
several  pages  and  each  page  stores  a  disjoint  subset  of 
variables  with  their  adjacency  lists.  We  may  think  of 
each  page  as  representing  a  subset  of  variables,  and  the 
union  of  these  subsets  is  the  whole  set  of  variables  We 
call  an  SU  or  SLIE  system  stored  in  this  way  a  payed  SLI 
or  SLIE  system. 

A  paged  SLI  or  SLIE  system  can  be  organized 
hierarchically.  More  formally,  let  V  be  the  set  of  vari¬ 
ables  of  an  SUE  or  SL)  system.  Let  |  y, . V„  j 

be  a  partition  of  V.  We  call  each  subset  Vx  of  V  a  block 
at  level  1.  A  block  at  level  1  corresponds  to  a  page 
how  we  can  similarly  partition  Pinto  subsets  of  blocks 
at  level  1  and  call  each  subset  a  block  at  level  2  A 
block  at  level  2  thus  contains  several  blocks  at  level  1 
as  members  This  process  can  be  continued  to  higher 
levels.  We  stop  when  we  get  to  a  block  that  contains  all 
of  V.  and  such  a  hierarchy  can  be  represented  by  a 
tree  The  root  is  the  block  at  the  highest  level,  the 
leaves  are  the  blocks  at  level  1,  which  correspond  to  the 
pages.  When  a  paged  SUE  (or  SU)  system  is  hierarchi¬ 
cally  organized,  we  call  it  a  hierarchical  SUE  (or  SU) 
system. 


Definition  Suppose  blocks  K,.Vt  are  both  members  of 
block  V.  When  there  is  a  constraint  relating  a  variable 
*i  in  V|  and  a  variable  xt  in  Vt.  we  aay  that  z,  and  xt 
are  ndcr  vui  tablet  of  V,  and  Vt  respectively  - 

Note  that  the  outer-variables  of  a  block  are  the 
variables  that  interact  with  variables  of  other  blocks  at 
the  tame  level  Usually,  the  number  of  outer-variables 
is  small  compared  to  the  total  number  of  variables. 

Before  going  further,  we  present  a  slightly  modified 
version  of  the  PERT  algorithm  and  point  out  how  page¬ 
thrashing  may  occur.  Let  us  extend  slightly  the 
definition  of  an  SU  system  to  include,  besides  a  set  of 
simple  Linear  inequalities,  a  aet  of  constraints  \ 
x,  a  x%o  I  Xt  is  a  variable.  is  a  constant  {.  We  call  such 
an  extended  system  a  preconditioned  SU  system  The 
following  modified  version  of  the  algorithm  PERT  solves 
the  compaction  problem  for  a  preconditioned  SU  sys¬ 
tem  in  a  way  similar  to  topological  sorting  [K] 

Algorithm  smPLE  PERT 

Input ;  a  set  of  simple  linear  inequalities  together  with  a 
set  |  xx  *  xxC  lx*  is  a  variable,  x^o  is  a  constant  j 
CXatput  :  |  f(z,)ls,  is  a  variable  {  (comment  an  optimal 
solution  to  the  preconditioned  SU  system  where  t  (x )  is 
the  value  for  variable  x ) 
begin 

t  (*»)  =  0.  p(xt)  =  Xt0.  for  all  i; 

<n  (x,)  =  1  fxt  !  xk  +  c  xx  is  a  constraint  j : 

5  :*  empty  queue. 

(comment  initialization) 

Find  all  x,  where  m(xl )  s  0.  put  x1  in  5; 

while  5  is  not  empty  do  begin 

Pop  a  variable  xx  from  5. 

*(*»)  :=*>(*»): 

for  Zj  such  that  x,  +dL  c  x}  Is  a  constraint  do  begin 
p(xj)  :=  max  (p(x;).  ^xj  +  c^  ): 
in(x;)  :=  tn(x;)  -  1; 

If  wi  (x; )  =  0.  then  put  x;  in  5 

and 

and 

and. 

Referring  to  SIMPLE  PERT ,  we  see  that  whenever 
a  variable  in  a  page  different  from  that  of  the  current  xx 
is  referenced,  a  page  fault  occurs.  In  practice,  for  large 
problems,  this  can  happen  quite  often,  as  illustrated  by 
the  following  simple  example  of  an  SLI  system  This 
example  also  motivates  the  basic  ideas  that  will  be  used 
later  on. 

Example  2.1  The  aet  of  variables  \ 
s^  1 1  <  i  «  3,  1  s  ;  an  |  is  partitioned  into  3  disjoint 
blocks. 

^1  ■  1  *10.  *11  .  *l«  i 

Vt  -  1*«0.  **1 . **»  I 


V»*  t  *».*), . **.  I 

We  mume  that  each  block  ia  is  one  page  The  con¬ 
straints  are  the  following,  where  d  is  a  positive  integer: 

**  +  d  a  **(i*i).  i  «  1.  2.  3:  >  «  1 . (n-1). 

*io*W**s»- 

*w+rf  sin 

Representing  a  constraint  z  +  d«x‘  by  z-*x’,  we 
obtain  a  dag  C  as  shown  in  Figure  2.1.  In  Figure  2  1,  the 
variables  are  arranged  so  that  the  rows  correspond  to 
the  blocks,  and  the  columns  represent  the  successive 
configurations  of  the  queue  5  when  SIMPLE  PERT  is 
applied.  From  the  picture,  we  see  that  3n  page-faults 
occur  when  SIMPLE  PERT  is  applied  Although  this 
example  exhibits  bad  behavior  using  a  FIFO  queue,  simi¬ 
lar  examples  can  be  contrived  for  LIFO  and  other  list- 
management  disciplines. 

Now  we  examine  the  example  more  closely  and 
show  how  page-faults  can  be  reduced  The  interaction 
among  the  blocks  can  be  represented  by  a  dag  H  on 
the  outer-variables  x1D.  z{l,  i  s  1,  2.  3.  as  shown  in  Fig¬ 
ure  2  2.  The  dag  H  actually  represents  the  dependency 
relation  among  the  outer-variables.  In  H.  there  is  an 
arc  from  an  ouler-vanable  to  another  if  the  latter  is 
reachable  from  the  former  in  C  without  passing  through 
any  other  outer-variable.  If  an  outer-variable  has  no 
predecessor  in  H,  then  it  does  not  depend  on  any  vari¬ 
able  external  to  the  block  it  is  in.  Therefore  we  can 
compute  the  value  for  it  if  the  block  it  belongs  to  is 
fetched  to  the  working  space.  After  the  value  of  an 
outer-variable  is  computed,  we  delete  it  from  H  It 
then  holds  inductively  that  at  any  point  of  the  algo¬ 
rithm,  an  outer-variable  with  no  predecessors  in  the 
remaining  part  of  dag  H  is  one  whose  predecessors  in  C 
that  are  external  to  the  block  it  belongs  to  are  all  com¬ 
puted  Therefore  we  can  compute  such  a  variable  if  the 
block  it  belongs  to  is  fetched  next. 

In  the  beginning,  since  z}0  has  no  predecessor  in 
H .  we  can  fetch  V,  and  compute  the  value  for  Zj0  After 
we  compute  Z]Q.  we  can  compute  xn  since  Xjo-  its  only 
predecessor,  is  computed.  However.  xxt  cannot  be 
computed  since  it  depends  on  Xjc  which  is  external  to 
Vv  Since  the  value  of  xic  is  computed,  we  delete  it 
from  H  (Fig.  2.3).  Now  that  z,0.  the  only  predecessor  to 
the  outer -variables  of  iy  is  computed,  we  can  compute 

•U  the  variables  in  Vt  in  the  order  xt, . x**.  Now 

we  delete  xK  and  xn  from  H  (Fig.  2  4).  then  we  see  that 
we  can  fetch  V%  and  compute  Xso>  *si . *sn  since  all 


the  predecessors  of  zK  and  xm  that  are  axtemal  to  Va 
are  already  determined.  Finally,  iK  and  x0  are 
deleted  from  H  (Fig.  2.5).  we  fetch  and  the  values  for 

Xjc . x ln  can  be  computed.  In  this  way,  only  4  page- 

faults  occur.  • 

We  call  the  dags  on  the  outer-variables  constructed 
in  Example  2.1  the  outer-dogs  associated  with  the  SL1 
•yxtems.  The  example  illustrates  the  fact  that  the 
outer-dag*  contain  information  that  is  useful  for 
arranging  the  page-fetching  to  reduce  the  number  of 
page-faults. 

In  tbe  algorithms  we  will  present  later  on.  algo¬ 
rithm  SIMPLE  PERT  will  only  be  used  locally  within  a 
page.  We  observe  that  algorithm  SIMPLE  PERT  is 
essentially  topological  sort  on  the  variables  with 
respect  to  the  partial  order  induced  by  the  inequality 
constraints.  In  the  hierarchical  situation  this  approach 
will  be  extended  (1)  to  exploit  useful  partial  orders  on 
the  outer-variables  or  interesting  sets  of  outer-variables 
that  are  induced  by  the  constraints;  (2)  to  find  efficient 
methods  for  computing  such  partial  orders. 


3.  Page-fault  Complexity  for  Computational  Dags 

In  this  section,  we  will  discuss  page-fault  complex¬ 
ity  in  the  setting  of  paged  computational  dags .  which 
correspond  in  a  natural  way  to  paged  SL)  systems  We 
consider  s  general  computational  problem  that  can  be 
characterized  by  a  dag  C  -  ( V,E ).  A  node  represents  a 
computational  step  and  the  arc  set  E  represents  the 
dependency  relation  on  the  computational  steps  That 
is  if  (v .u)ef.  then  tbe  computation  v  cannot  be  done 
before  v.  We  suppose  V  is  partitioned  into  subsets 

Vx .  Vp.  and  cal!  each  subset  a  block  We  may  think  of 

a  block  with  the  adjacency  lists  of  its  nodes  being 
stored  in  a  page  and  therefore  we  call  a  dag  C  -  {WE) 

with  a  partition  Va  j  K, .  Yp  \  a 

paged  compuiational  dag.  or  simply  pagtd  dag  If 
(u,u)  e  E  with  v  e  V*.  u’  c  VJ.  and  idy  then  we  say  u 
is  an  out  e  mode  of  V%,  and  u'  is  an  outrr*node  of  Vj. 

In  the  sequel,  we  will  always  use  to  denote  tran¬ 
sitive  closure 

Definition  For  u.vcF  we  say  v -»v  when  (u.u)cE  So 
v-*v  if  and  only  if  v  is  reachable  from  v.  For  outer- 
nodes  um\  we  say  v<,u  when  u'  is  reachable  from  u 
without  passing  through  any  other  outer-node.  * 
Definition  Let  V  be  tbe  set  of  outer-nodes  We  define  a 
directed  graph  H  on  U  as  follows  H  *  (£/,£,),  and  £, 
■  \  (u,v)luC*v  and  u.vel/  J.  It  is  clear  that  H  is  a 
dag  Ws  call  H  the  outer -dag  associated  with  C  • 
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Definition  Ut  C  *  (V.E).  P=  |  V, . V,  j  be  os 

described  before  Let  £  *  ($£)  where 
£  *  t  (K»*S)  I  there  are  outer-nodes  u  €  K(.  vcV,. 
(u.v)e£  (.  We  call  C  the  super-graph,  associated  with  C 
and  V  When  5  is  a  daft,  we  call  it  a  ruprr-dag  '  • 

In  particular,  for  an  SLI  system,  the  corresponding 
computational  daft  is  C  *  (V.E)  where  V  is  the  set  of 
variables,  a  node  represents  the  determination  of  the 
value  of  the  variable  in  it.  and  E  *  [  there  is  a 

constraint  2,  +  rf^  *  z}  {  In  this  case,  a  paged  dag  is  just 
a  paged  SLI  system,  and  a  block  corresponds  to  a  page 
of  the  SLI  system  The  outer-nodes  are  just  the  outer- 
variables  in  V.  For  variables  x%,  x}.  xl-Xj  if  and  only  if 
there  is  a  constraint  Xt+rf^  a  Xj.  For  outer-variables 
Xy.  x},  xx  <,  xi  If  and  only  If  and  there  is  no 

.outer-vanable  xt  such  that  x^'x*  and  x*-*#xr  Note 
that  for  every  paged  dag  G*  (V.£)  with  partition  V 
there  is  a  paged  SLI  system  whose  corresponding  paged 
dag  can  be  represented  by  G  and  P 

Definition  Let  G  =  (V.£)  be  a  dag  For  veV.  define 
/  (v )  =  1  if  v  bas  no  predecessor,  else 
/  (i- )  =  max  |  I  (ti )  ii  is  a  predecessor  of  v  in  C  {+ 1 
Call  l  (v )  the  level  of  v  in  C  • 

Definition  Let  G  «  (V.E)  be  a  dag  A  node  v  in  G  is  said 
to  be  exposed  if  and  only  if  indegree(v)  *  0.  that  is.  v 
has  no  predecessor  ■ 

Given  a  dag  C  =  (V.E)  and  a  subset  W  of  V,  the  fol¬ 
lowing  procedure  recursively  deletes  the  exposed  nodes 
which  belong  to  W  We  will  use  this  procedure  later  on 

Procedure  DELETER  C.  W  ) 

(comment  C  =  (  V.  £  )  is  a  graph,  W  is  a  subset  of  V' ) 
begin 

while  there  is  a  node  u  in  W  that  is  exposed  do 

begin 

C  =  C  -  v 
W  :=  W  -  u . 

end. 
end  • 

In  carrying  out  the  computational  steps  of  a  paged 
dag  C  -  (V.E).  we  assume  that  only  one  block  is  allowed 
to  reside  in  the  working  space  at  one  time  When  a 
block  is  fetched  to  the  working  space,  some  nodes  in 
the  block  can  be  computed.  When  it  is  not  possible  to 
proceed  any  further,  a  new  block  must  be  fetched  mto 
the  working  space  and  the  original  block  stored  back  in 
the  secondary  memory,  in  which  case  a  page-fault 
occurs. 

Let  <  >.  t  =  1  be  the  sequence  of  page- 

faults  that  occur  in  the  course  of  a  computation  That 
is.  when  the  i-tb  page-fault  occurs,  block  is  fetched 
to  the  working  space,  and  V ^  +  The  number  of 


page-faults  is  simply  the  length  of  the  sequence.  We 
call  a  sequence  of  blocks  a  paps  sequence,  call  a  page 
sequence  that  corresponds  to  a  full  computation  of  the 
paged  dag  a  legal  pope  sequence. 

Suppose  that  an  outer-node  is  deleted  from  the 
outer-dag  once  it  is  computed.  Then  an  outer-node 
becomes  exposed  exactly  when  all  its  precedent  outer- 
nodes  are  computed  and  deleted.  Now  let  H  be  the 
current  outer-dag  with  all  the  computed  outer-nodes 
deleted  at  the  time  the  i-th  page-fault  occurs,  and  let 
V ^  be  the  outer-nodes  of  When  the  i-th  page-fault 
occurs  end  is  fetched  to  the  working  space,  the 
exposed  outer-nodes  in  V ^  have  all  their  precedent 
outer-nodes  already  computed.  Therefore  they  car  be 
computed  and  deleted  before  the  (i-fl)-st  page-fault 
occurs  After  they  are  computed  and  deleted,  there 
may  be  new  outer-nodes  in  V ^  that  become  exposed 
Similarly,  they  can  also  be  computed  and  deleted  from 
H  before  the  next  page-fault  occurs  Inductively,  we 
see  that  the  outer-nodes  in  that  can  be  computed 
and  deleted  are  exactly  those  that  would  be  deleted 
when  procedure  DELETE  is  applied  to  H  and 
Therefore,  the  following  is  true 

Lemma  3.1  Let  <  >  be  i  page  sequence  Let  L\x  be 
the  outer-nodes  of  the  block  V ^  Let  H  be  the  outer- 
dag  with  all  the  computed  outer-nodes  deleted  at  the 
time  the  i-th  page-fault  occurs  Then  (1)  between  the 
i-th  and  (t  +  l)-si  page-faults,  the  outer-nodes  of  block 
V^  that  can  be  computed  are  exactly  those  that  are 
deleted  when  procedure  DELETE  is  applied  to  H  and 

(2)  between  the  i-th  and  (t  +  l)-st  page-faults,  the 
nodes  v  in  that  can  be  computed  are  exactly  those 
v  in  whose  precedent  outer-nodes  u,  where  u  -*'v. 
ore  deleted  before  the  (i  +  l)-st  page-fault  occurs.  • 

Lemma  3.1  shows  that  the  computational  effect 
between  two  page-faults  can  be  described  by  a  process 
that  recursively  deletes  the  exposed  outer-nodes  of  a 
block  from  the  outer-dog  To  illustrate  this  idea,  let  us 
look  at  Example  3.1. 

Example  3  1  A  paged  dag  C  and  its  associated  outer- 
dag  H  are  depicted  in  Figure  3.1,  where  x^’s  are  in 
block  V,,  Xgj’s  are  in  block  Vt. 

A  legal  page  sequence  is  V,.  Vt,  V,.  Let  us  examine 
the  action  taken  with  each  page-fault. 

Vi  is  fetched  to  the  working  space  (see  Figure  3.2): 


\ 
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mode  Xio  i>  deleted  from  H .  then  x„ 

<he  remaining  part  of  C  when  the  deletion  process 
is  applied  is  shown  in  Figure  3.2. 

Vt  is  fetched  to  the  working  space  (see  Figure  3.3): 
mode  xn  is  deleted  from  fi,  then  »u. 

•all  nodes  in  Kt  will  be  deleted. 

Finally.  V(  is  fetched  to  the  working  space: 

•node  x,4  is  deleted. 

•all  nodes  in  V,  will  be  deleted.  • 

Ideally,  given  a  paged  dag.  we  would  like  to  minim* 
ize  the  number  of  page-faults  for  a  full  computation  of 
the  dag  That  is.  we  would  like  to  find  a  legal  page 
sequence  that  has  minimum  length  for  the  paged  dag 
Unfortunately,  this  problem  is  NP-complete.  Namely, 
we  have  the  following  result: 

Theorem  3.1  The  following  problem  is  NP-complete 
given  a  paged  dag  and  an  integer  kt  are  k  page-faults 
sufficient  for  computing  the  paged  dag° 

The  proof  of  Theorem  3.1  can  be  found  in  [H],  it 
uses  reduction  from  feed-back  vertex  set.  Theorem  3.1 
implies  that  to  achieve  the  minimum  possible  page- 
faulting  is  practically  infeasible,  assuming  of  course 
P  *  SP.  However,  we  have  the  following  result 

Theorem  3.2 

(1)  In  the  worst  case,  for  a  paged  dag  of  p  blocks  and  m 
outer-nodes  in  every  block,  at  least  mp  page-faults  are 
necessary  for  computing  the  paged  dag. 

(2)  There  is  an  algorithm  that  computes  a  paged  dag  of 
p  blocks  with  no  more  than  m  outer-nodes  in  every 
block  in  linear  time  and  space,  with  the  number  of 
page-faults  no  more  than  mp  • 

Theorem  3.2  will  be  established  in  several  steps, 
and  an  algorithm  for  (2)  will  actually  be  constructed 

We  prove  (1)  first  by  constructing  paged-dags  for 
which  the  worst  cose  occurs,  given  p,  the  number  of 
'blocks,  and  m.  the  maximum  number  of  outer-nodes  in 
a  block  Choose  positive  integers  a  and  b  such  that 
B+b  sj).  We  form  a  dag  H  =  ( V.E9 )  so  that  level  2i  4  1 

o?  H,  i  *  0 . m-1,  consists  of  a  nodes  u*.  k  =  1 . a, 

level  2i  consists  of  b  nodes  u*,  k  =a  +  l,.  ,p. 

i  =  1 . m.  and  for  node  it,  u  m  //,  (u,  u  )  is  an  arc  ir 

and  only  if  l(u  )  a  l(u  ‘)-l.  It  is  easy  to  construct  a  dag 
C  with  partition  JV,.  .  .  Vp\  on  V  such  that  for 

k  -  1 . p ,  the  utj's  are  in  Vt,  and  the  associated 

outer-dag  is  the  same  as  H  By  Lemma  3  1  and  the  way 
H  is  defined,  any  legal  page  sequence  for  C  has  length 
*  V  =  mp  This  proves  (1). 


Now  we  turn  our  attention  to  (2).  first  finding  a 
legal  page  sequence 

Suppose  C  *  (V>£)  is  a  paged  dag  with  the  associ¬ 
ated  partition  P  on  V.  Suppose  the  associated  outer- 
dag  H  •  (U'E,)  has  been  determined:  then  the  following 
procedure  will  find  a  legal  page  sequence  for  C. 

Procedure  SEQUENCE 

Input  H.  the  outer-dag  associated  with  a  paged  dag  C. 
Output  a  legal  page  sequence  5. 

S  :*  empty  sequence;  H  :=  the  input  outer-dag  (com¬ 
ment  initialization) 

begin 

while  H  is  not  empty  do  begin 

choose  an  exposed  node  u  in  H  \ 

If  ti  is  from  block  Vx.  then  apply  procedure 
DELETE  on  H  and  the  subset  of  outer-nodes  of 

K 

append  Vx  to  the  end  of  the  sequence  5 

end 
end,  • 

Let  m  be  an  upper  bound  on  the  number  of  outer- 
nodes  a  block  can  have  and  p  be  the  number  of  blocks 
Then  the  length  of  the  legal  page  sequence  determined 
by  procedure  SEQUENCE  sc  if/*  mp 

Instead  of  actually  constructing  the  outer-dag  and 
applying  procedure  SEQL'ENCE .  we  will  describe  an 
algorithm  for  computing  a  computational  dag  that  will 
result  >n  the  same  legal  page  sequenre  as  is  determined 
by  procedure  SEQL'ESCE  The  algorithm  is  based  on 
the  argument  that  was  used  to  prove  Lemma  3  1 

Given  a  set  of  newly  computed  nodes  in  a  bloc*  V. 
the  following  procedure  LOCAL  will  compute  as  many 
nodes  in  V  as  possible  In  this  procedure  as  well  as  in 
the  algorithm  we  will  describe,  S(V)  will  denote  a  set  of 
newly  computed  nodes  in  V,  tn(t)  will  denote  the  in- 
degree  of  v  for  a  node  x. . 

Procedure  LOCAL*  5(H),  V) 

begin 

go  until  S(  V)  is  empty  begin 
pop  a  node  v  from  S{  H); 
for  u  such  that  {  v .  u  )  is  an  arc  do  begin 
in  (u  )  :=  in(u  )  -  1; 

If  in  (u  )  =  0  then  do  begin 

If  u  e  V,  then  put  u  in  5(  V). 

If  u  C  ti,  for  some  i  then  put  u  in 

5.(V/y  j 

end 

end 

end 

end 

Notice  that  St(Vx)  is  small  since  it  is  a  subset  of  L\. 
We  may  assume  that  these  St(H%)  are  kept  in  the  main 
memory  throughout  the  computation  The  following 
algorithm  computes  a  computational  dag  C  with  the 
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tame  page  sequence  determined  by  procedure 
SEQUENCE. 

Algorithm  COMPUTE  DAG 
begin 

for  each  block  V  do  begin 

S(V)  ;r|vc  Vl*i(v)  *  0  j; 

LOCALU  S(V).  V): 

end. 

do  until  all  S'.  (  V)  are  empty  begin 

for  non-empty  S’,  (V)  do  LOCALi  5.  (K).  V); 

end. 

end 

We  can  tee  by  induction  that  between  two  page- 
faults,  procedure  LOCAL  computet  the  nodes  which  are 
characterized  by  Lemma  3.1.  And  the  execution  of 
algorithm  COMPUTE  DAG  will  retult  in  the  tame  number 
of  page-faults  a*  is  determined  by  procedure 
SEQUENCE .  which  i*  bounded  by  mp  Finally,  the  total 
running  time  and  space  is  linear  in  the  number  of  arcs, 
as  is  clear  from  the  algorithm  • 

From  the  above  discussion,  we  aee  that  the  outer- 
dags  associated  with  the  paged-dags  play  an  essential 
role  in  studying  the  page-fault  complexity.  The  ideas 
presented  in  this  section  will  be  extended  in  the  nert 
section  to  deal  with  the  general  hierarchical  situation. 
Our  goal  is  to  obtain  an  analog  of  (Z)  of  Theorem  3  2  for 
the  general  case  of  hierarchical  SUE  systems,  where 
equality  constraints  are  also  involved 


4  Hierarchical  Compaction  for  SU  and  SUE  eyxtems 

In  this  section,  we  will  extend  the  ideas  presented 
in  the  last  section  to  deal  with  hierarchical  SU  and  SUE 
systems  All  the  algorithms  presented  in  this  section 
run  in  total  linear  time  and  space  with  a  strict  bound  on 
the  number  of  page  faults.  In  Section  4  1.  we  discuss  an 
important  subclass  of  SU  systems  that  are  character¬ 
ized  by  an  acyclic  property  In  Section  4.2.  we  discuss 
the  general  case  of  hierarchical  SUE  systems. 

4  1  Hierarchical  Local  Pert  for  Acyclic  SU  Systems 

First  let  us  extend  the  definition  of  super-graph  to 
hierarchical  SU  systems 

Definition  Let  V  be  a  block  in  a  hierarchical  SU  sys¬ 
tem.  and  let  P's  }  v,  ,  tm  f  be  the  set  of  member 
blocks  of  V  Define  the  directed  graph  =  (  P.  £y  ) 
where  £v  *}  (v,  ,vf )  there  are  outer- 

variabies  x,  of  v,.  xf  of  such  that  x,  <#Xj  i  Call  Gy 
the  rup*T-graph  associated  with  V  When  Gy  is  a  dag 
call  it  a  ruprr-dng  * 


The  SU  systems  we  are  interested  in  in  this  section 
are  those  for  which  every  block  has  an  acyclic  super¬ 
graph.  For  such  an  SU  system,  the  associated  super¬ 
graph  of  every  block  V  at  any  level  Gy  =  (  P.  Ey  ).  is  a 
dag.  and  therefore  ty  defines  a  partial  order  <y  on  V 
Namely,  we  put  vt  <yvi  If  and  only  if  ( t%.tj)e£ 

Let  V,  W  be  two  blocks  at  the  same  level.  If  there 
are  z  c  V.  x'  e  W,  such  that  x  <,  x  \  then  we  say  V  is  a 
jrreesrfmp  block  of  W .  We  say  a  block  is  computed  if  all 
the  variables  in  it  are  computed 

For  a  block  V  at  the  bottom  level,  if  ail  the  preced¬ 
ing  blocks  of  V  have  been  computed,  then  ail  outer- 
variables  x  external  to  V  such  that  x  <,  x*  for  some 
x'  €  V  have  been  computed.  Therefore.  V  can  be  com¬ 
puted  next.  We  see  that  this  holds  inductively  for 
blocks  at  aU  levels.  Based  on  this  idea,  the  following 
algorithm  will  determine  the  values  for  all  the  variables 
of  a  block  V  by  proceeding  from  one  member  block  of  V 
to  another  tn  an  order  that  is  consistent  with  the  partial 
order  <yt  assuming  that  the  preceding  blocks  of  block 
V  are  all  computed 

Algorithm  (HIERARCHICAL  LOCAL  PERT) 
lnput  e  set  of  simple  Unear  inequalities  on  the  variables 
of  a  block  V,  Gy  =  (  V,  Ey  )  the  associated  super-dag.  a 
set  Pre(V)  containing  inequalities  of  the  forms  lire 
where  z  is  an  outer-variable  of  V  and  x0  is  a  constant, 
and  similarly  a  set  Pre  (v4)  containing  inequalities 
x  *  xc  where  z  is  an  outer-variable  of  .  for  each  v,  e P‘ 
Output  !  f(x)  x  C  V  {  (comment  f(x)  is  the  value  of  the 
variable  z  in  V) 
begin 

for  z€  V,  t (x)  :=  0;  p(x)  :=  0;  (comment  initialization) 

If  Vis  a  bottom  level  block  then 

begin 

p(x)  =  max  fo'ttsii  in  Pre(V)  for  every'  outer- 
variable  x  of  V, 

SIMPLE  PERT  on  V; 

end 

if  K  is  not  a  bottom  level  block  then 
begin 

Select  the  member  blocks  of  V  in  etc.  order  that  is  con¬ 
sistent  wittuthe  partial  order  on  V,  for  each  selected 
block  in  v  do 

begin 

p(x)  :*  max  !a 'ita  is  in  Pre(\*%)  j.  for  every 
outer-variable  z  ofv*; 

HIERARCHICAL  LOCAL  PERT  on  vt 
for  Vj  such  that  (v*  ty  )e£ do 

Pre  (tj)  =  Pre  (tj  )  J  zx  *  /  (x  )♦*  x+rfax, 
is  a  constraint  with  x  c  vt.  x,  t  v;  (. 

end 

end 

•Dd 

To  find  an  optimal  solution  to  an  acyclic  SU  sys¬ 
tem.  we  first  construct  the  super-dags  Gy  tor  ail  the 
blocks  This  takes  linear  tome  and  space  and  p  page- 


faults  where  p  is  the  total  number  of  pages.  We  then 
apply  Hierarchical  Local  Pert  on  the  outer-most  block  V 
with  Prt  ( V)  *  {  x*0 ' x  is  an  outer-variable  in  V  J.  It  is 
easy  to  see  that  every  page  is  fetched  to  the  main 
memory  once  Therefore,  we  have  the  following: 
PropoalUoo  4.1  Tor  a  hierarchical  SU  system  with  acy¬ 
clic  super-graphs  for  all  the  blocks  at  all  levels,  an 
optimal  solution  can  be  found  in  linear  time  and  space, 
causing  the  number  of  page-faults  2 p  where  p  is  the 
number  of  pages  • 

4  2  Compaction  for  General  Hierarchical  SU  and  SUE 
Systems 

In  this  lection,  we  discuss  a  hierarchical  algorithm 
for  finding  an  optimal  solution  to  a  general  hierarchical 
SLIE  system.  This  algorithm  also  applies  to  SLl  systems 
since  SLl  jystems  are  also  SUE  systems. 

Definition  Denote  by  the  relation  Rx (x,.xf)  (or  zxRxxt) 
the  condition  where  there  is  a  constraint  zx  +  dxz<zz. 
and  by  Rz{xx.xz)  (or  the  condition  where  there 

is  a  constraint  zx  =  zz.  We  say  z  -»x  if  there  are 
x,.  z2  ,  x*.  q0.  ax.  .  a*  with  a%  €  j  1.  2  j.  such 
that  /?*a(x,x,).  /?-1(x1.x2).  .  .  .  .  R*k(zk.z  ).  and  there  is 
exactly  one  i  such  that  a,  =  1. 

The  relation  -•  just  defined  can  actually  be  viewed 
as  the  extension  of  the  relation  +  defined  for  paged 
dags  m  Section  3  The  relation  -•  on  the  variables  has 
the  following  interpretation.  The  value  of  a  variable  z 
cannot  be  computed  until  ell  the  variables  x‘  such  that 
z  —  *z  are  computed  Jt  is  clear  that  -*  must  be  acyclic, 
otherwise  the  SLJE  system  is  unso'/able  (Recall  that 
wt  assume  >  0  for  inequality  constraint 

♦  d»j  *  xj  ) 

Suppose  V  is  a  block  of  an  SLIE  system.  Let  Z(V) 
be  the  set  of  the  outer-variables  of  the  member  blocks 
of  l'  We  now  define  the  relation  <,  on  the  outer- 
variables  in  Z{  V) 

Definition  Let  V  be  a  block  of  an  SUE  system,  and  let 
z.x  be  two  outer-variables  in  Z[V ).  We  say  x<0x  (in 

V)  if  and  only  if  z  -*x '  or  there  exist  variables  X] xm 

in  V  such  that  x-»x,.  Xj-»x2 xm -*x  and  none  of  zx, 

i  =  1.  ...  m.  is  an  outer-variable  in  Z{V)  • 

Note  that  the  definitions  of  -  and  <*  when  res¬ 
tricted  to  the  SU  systems  are  consistent  with  the  ones 
defined  in  section  3.  For  an  SLIE  system,  if  we  disre¬ 
gard  the  equality  constraints,  we  get  an  SU  system 
which  we  call  the  SU  subsystem  of  the  original]  SUE  sys¬ 
tem  When  we  restrict  -*  to  the  SU  subsystem,  we  call  it 
-•  with,  retpeef  to  tht  SLl  eubrystsm,  and  likewise  for 
<. 


From  the  definition  of  <*,  we  see  that  <f  must  be 
•cyclic.  For  every  block  V,  we  can  construct  Wi  outer - 
dag  on  the  outer-variables  Z(V)  with  respect  to  <B  as 
before  Indeed  for  SU  system*,  the  method  that 
worked  for  paged  dags  in  Section  3  can  be  extended  in  a 
simple  way  to  work  for  hierarchical  SU  systems.  How¬ 
ever.  for  SUE  systems,  some  complications  arise  due  to 
the  presence  of  equality  constraints.  For  instance  in 
Example  3  of  Section  3.  if  we  add  the  following  con¬ 
straints:  x,e  =  Xw.  xl2  =  xn.  and  x,4  =  xa4.  then 
Xjc<£Xjt.  and  the  outer-dag  H  becomes  as  shown  m 
Figure  4.1.  We  now  cannot  apply  procedure  DELETE  to 
determine  a  legal  page  sequence.  For  example, 
although  xic  is  exposed  in  H .  we  cannot  compute  it 
immediately,  for  otherwise,  since  X)C  =  x2o>  we  would 
have  been  able  to  compute  xK  without  looking  at  biock 
Vs.  The  existence  of  an  equality  z  =  z'  implies  that 
when  one  of  the  two  variables  is  computed,  so  is  the 
other.  To  take  this  fact  into  account,  we  need  to  con¬ 
sider  the  equivalence  classes  on  the  outer-variables 
formed  with  respect  to  the  equalities. 

Definition  We  partition  the  set  of  variables  in  an  SUE 
system  into  equivalence  classes  with  respect  to  Rz  the 
relation  defined  by  equalities.  We  write  [x  ]  for  the 
equivalence  class  [z]  is  in,  and  call  [x]  an 
•yuoiify  class  • 

Clearly,  fxj  =  [x2]  if  and  only  if  x1/?2*x2 
Definition  We  say  a  block  V  f crushes  [x ]  if 

[x]  ^  Z(V)  *  d  Jn  this  case,  we  Jet  [x]t  be 

[x]  ^  Z(V).  and  call  [r]v  the  resfnefum  of  [x  ]  to  V. 
When  a  block  V  is  specified,  we  sometimes  write  [x  ]  or 
[x]  tn  V  for  [x]v<-  • 

We  note  that  if  for  every  variable  xf  in  an  equality 
class  [x],  the  values  of  all  variables  zi  such  that  x,  -*  *x, 
are  computed,  then  the  value  f[x]  of  all  the  variables  in 
[x]  can  be  computed  as  follows 

(1) p(x,)*-  max  (  f(x,)  +  d^  'x^+d^sx,  is  a  constraint  j. 

(2)  «[*]*•  mu  }  p(*,)  (. 

Now  we  extend  the  definition  of  <,  to  the  equality 
classes 

Definition  Let  [xj).  [x2]  be  two  equality  classes  that 
touch  a  block  V. 

(1) Jf  V  is  a  bottom-level  block,  then  write  [x1]<,[x21  (in 
V)  If  and  only  If  there  are  xc[x,].  x  c[xf)  such  that 
i<,x  in  V 

(2) If  V  is  not  a  bottom-level  block,  then  write  [x  j]<»  [x2] 
(in  V)  if  and  only  if  there  are  xe[xj].  x'e[xf)  such  that 
•ither  /?,(*. x  )  or  [xj]*<t[x|]v  where  u  it  i  member 
block  of  V  that  touches  both  [x,]  and  [x2]  • 
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Lamoii  4.1  Let  V  be  a  block  in  a  hierarchical  SUE  tys- 
tem.  Then 

(1)  the  relation  <«  define*  a  partial  order  on  the  equal¬ 
ity  classes  that  touch  V. 

(2)  if  x.x  are  two  outer-variable*  in  V,  then  *<«**' 
implies  [*]<,•[*■] 

Proof 

Without  lost  of  generality,  we  assume  that  there  is 
at  least  one  outer-variable  in  each  block  so  that  every 
block  touches  at  least  one  equality  class 

Part  (1)  is  clear  from  the  definition.  Part  (2)  we 
prove  by  induction  on  the  level  of  V  in  the  hierarchy. 

For  y  at  the  bottom  level,  the  assertion  follows 
immediately  from  the  definition  Suppose  V  is  not  at 
the  bottom  level  If  x.x  are  in  the  same  member  block 
u  of  V.  then  by  induction,  x  <,  #x  *>  [x]*<#  *[x‘]*,  and 
therefore  [x]<,  #[x  ] 

If  x.x  are  in  different  member  blocks,  say  xev0. 
xevm.  putting  x  s  xCi.  x  *  xmS.  then  x  <#  *x  implies 
that  there  are  member  blocks  of  K.  i,.  ,  fm_j  and 

outer-variable?  2(),  i  =0 . m  such  that 

x»i R\  X(,*  u,.  for  i  =  0 . m-1  Now  by 

induction  zx ,<,  *xt£  =  >  [x4 ,]V|<#  '[xt2]v  t  =  0 . m  =  > 

[xl3]<#  *[xl2]  Aiso  x,* /?,  X(lM) j  =>  [*ts]<#[*(%M)l]. 

i  =  0 — m-1  Therefore,  [x  ]<,  #[x  ]  • 

By  Lemrr.o  4  1.  for  every  block,  we  can  form  a  dag 
or.  the  equality  classes  with  respect  to  <0  Prom 
Lemma  4  1 ,  we  also  see  that  for  a  block  V  and  an  equal¬ 
ity  class  [z]  that  touches  V'.  if  ai!  equality  classes  [x  ] 
such  that  [x  }<9  [x]  have  been  computed,  then  the 
number  mint  (  <(ij  e  ^  x,  ♦  d  *  x .  x, €  V  (  can  be  com¬ 
puted  When  V  is  the  highest-level  block,  then  this 
value  is  exactly  the  value  for  all  the  variables  in  [x] 
Therefore  we  car  proceed  from  equality  class  to  equal¬ 
ity  ciess  in  an  orcer  that  is  consistent  with  the  partial 
order  <« . 

As  in  Section  3.  we  need  not  construct  the  dag  that 
is  defined  with  respect  to  the  partial  order  <,  As  a 
preparation  stage,  we  need  only  compute  the  equality 
classes  This  can  be  done  by  applying  breadth-flrst- 
search  hierarchically,  treating  an  equality  x  =  x  as  an 
edge  (x.x)  We  also  compute  in[x]  which  is  the  sum 
of  tn  (x )  for  x  €  [x  ).  where  in  (x )  denotes  the  number  of 
x  so  that  there  is  a  constraint  x  ♦  d  £x  All  these 
computations  can  be  done  by  examining  each  bottom- 
level  block  once.  To  simplify  the  presentation,  we  also 
assume  that  within  each  bottom-level  block,  there  are 
only  inequality  constraints  binding  variables  in  that 
block  This  can  be  achieved  by  forming  equivalence 
classes  with  respect  to  equality  constraints  within  each 


bottom-level  block,  choosing  one  representative  for 
each  class,  and  substituting  ail  the  variables  in  a  class 
by  the  representing  one. 

The  main  procedure  LOCAL  EQ-PERT  is  described 
below.  It  takes  two  arguments  V  and  Y(V).  where  V  is  a 
block.  Y{V)  is  a  set  of  newly  computed  variables  if  V  is 
at  the  bottom  level,  and  a  set  of  newly  computed  equal¬ 
ity  classes  otherwise.  It  will  compute  as  many  variables 
and  equality  classes  in  V  as  possible,  starting  from  the 
given  Y{V).  In  the  procedure  as  well  as  in  the  algo¬ 
rithm,  t(z)  will  denote  the  current  lower  bound  on  a 
variable  z,  t[x]  will  denote  the  current  lower  bound  on 
an  equality  class  [x].  The  final  values  of  t(x)  and  t  [z  ] 
will  be  the  values  of  z  and  [x]  respectively  Also.  Vc  will 
denote  the  highest-level  block. 

Procedure  LOCAL  EQ-PERT  (  Y(  V).  V  ) 

Begin 

If  V  is  at  bottom  level  then  while  Y{V)  is  not  emptv  do 

begin 

for  x  e  Y{V)  do  begin 

for  a  constraint  x  ♦  d  *  Xj  do  bogin 

f  (x,)  :*  max(  t  (x,).  t  (x)  +  d  ); 
t n(xl)  :=  tn(xi)  -  1; 

If  in(x,)  =  0  then  if  x,  is  not  an  outer- 
variable  then  put  2,  in  r(  V)  else  do  begin 
l[xj]  :=  max(  t  [x,);  t  (x,)  ). 
tn[x,]  =  in  [xj  -  1; 
if  in  [x,]  s  o  then  put  [x  J  in  Y(  Vc): 
end 

end. 

end 

end 

if  V  is  not  at  bottom  levs!  then  fo*  each  member  b  ock 
o  that  touches  at  least  one  equality  class  in  F(i  do 

begin 

Y(u)-.=  !  [x)„  [x]  c  Y(V)  1. 

LOCAL  EQ-PERT  1,  ).  u  ). 

end 

end 

Now  we  describe  the  algorithm. 

Algorithm  COMPUTE  SUE 
begin 

f  (x)  =  0  for  all  variables  x: 

t  [x]  :=  0  for  all  equality  class  [x]; 

for  each  bottom-level  block  V  do  begin 

for  variable  x  in  V  such  that  tn  (x )  =  0  do  begin 

if  x  is  not  an  outer-variable  then  put  x  in 

Y(V), 

If  x  is  an  outer-variable  and  infx]  =  0  then 
put  [x]  in  r(l'c); 

end 

LOCAL  EQ-PERT  (  Y(V),  V), 

•ad 

do  until  Y(  V0)  is  empty  begin 
LOCAL  EQ-PERT  {  Y(V0),  V0  ). 

•nd 

•nd 

Let  p  be  the  number  of  pages  in  •  hierarchical  SUE 
system  The  preparation  stage  for  computing  equality 
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elasses  takes  p  page-faults  Other  than  that,  algorithm 
COMPUTE  SUE  examines  each  bottom-level  block  no 
more  than  m  times,  where  m  is  a  bound  on  the  number 
of  outer-variables  in  a  bottom-level  block.  We  therefore 
have  finally, 

Theorem  4.1  Algorithm  COMPUTE  SUE  runs  in  time 
D( A')  where  A  is  the  total  number  of  constraints, 
causes  a  number  of  page  faults  no  greater  than  (m  ♦  l)p 
where  p  is  the  number  of  pages,  and  m  is  the  maximum 
number  of  outer-variables  a  page  (that  is,  a  bottom 
level  block)  can  have.  • 
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ABSTRACT 

Two  eels  of  condition*  are  derived  that  make  one- 
dimensional  bilateral  arrays  of  combinational  calls 
testable  for  single  faulty  cells.  Tbs  test  sequences  are 
preset  and.  in  the  worst  cast,  grow  qusdratically  with 
the  site  of  the  arrey  Conditions  for  testability  in  linear 
time  are  also  derived  The  beiic  cell  cen  operate  at  the 
bit  or  at  the  word  level  An  implementation  of  FIR 
filters  using  (systolic)  one-dimensional  bilateral  arrays 
of  cells,  which  can  be  considered  combinational  at  the 
word  level,  is  presented  as  an  example. 


1  Introduction 

The  use  of  iterative  arrays  of  identical  cells  in 
current  VLSI  technology  is  becoming  more  frequent  due 
to  their  meny  advantages,  like  ease  of  design,  fabrica¬ 
tion  and  testing  Moreover,  many  probltms  are 
efficiently  eolved  with  the  use  of  "eystobc  arrays' ,  which 
are  highly  iterative  structures  operating  synchronously. 
An  important  problem  aisocisted  with  these  ctructures 
is  fault  detection,  that  is,  derivation  of  tost  input 
eequences  to  the  array,  such  that  the  output  sequences 
of  the  normal  and  any  faulty  array  (under  some  fault 
assumptions)  are  different  In  this  paper  we  give  some 
testability  conditions  for  a  special  clast  of  arrays 
(defined  below)  that  improve  upon  the  condition 
reported  in  [3]  Derivation  of  the  tost  input  sequences 
is  also  described 

•  AoonmpUon*.  Definition*  and  Notation 

Figure  to  shows  e  bilateral  arrey  of  combinetional 
cells  The  basic  cell  l*  shown  in  Figure  lb.  At  eacb  time 
unit  It  produces  left  and  right  outputs,  depending  on  its 
left  right  and  vertical  inputs 

Let  ff  be  tbs  set  of  right-moving  signals,  L  the  eet 
of  left-moving  sign  sit  and  Z  the  eet  of  vertical  cell 
inputs  Let  p*ff*Z*Z-*ff  be  the  right-moving  signal 
mopping,  and  p A:ff  */*/■•/  be  the  left-moving  signal 
mapping 

A  fault  in  a  particular  call  alters  p*.  p*.  or  both  for 
one  or  more  arguments  (r ,  s .  ().  However,  we  assume 
that  the  eell  remains  combinational. 

We  assume  initially  that  to  tost  a  call  completely, 
we  must  apply  all  input  combinations  if  *Z*L  to  that 
cell  Tbit  assumption  makes  testing  of  the  cells 
independent  of  bow  they  are  realised  and  independent 
of  the  fault  model  for  permanent  faults  [l).  Wt  shall 
examine  inter  the  case  when  only  a  subset  of  ffx/x/ 
suffices  to  test  tbs  basic  sell  Wt  further  assume  that 
to  test  the  array  completely  (for  ringle  faulty  cells),  wa 
must  test  completely  every  cell  in  the  array. 

Tbs  left,  vertical  and  right  inpute  of  sell  /  at  time  f 
are  denoted  as  r*(l ),  e*(f).  lJ(t)  respectively.  Ueacc. 


cell  j  at  time  f  +  1  will  produce  left  and  right  outputs 

Figure  t  show*  the  apart -Tim*  fmnx/ormaiton  [2] 
of  the  array  In  Fig.  la.  Each  row  represents  the  array 
at  eacb  time  unit.  This  makes  the  operation  of  the 
arrey  easier  to  visualise.  Note  that  this  transformstion 
maps  the  one-dimensional  synchronous  bilateral  array 
Into  a  two-dimensional  asynchronous  unilateral  array. 

If  Lt  is  a  subset  of  L.  define  p*(r,  a,  /c)  *  ff0. 
where  ffe  ia  the  aet  |  p*(r,  a.  I)  I  lcLt\  If  Ft  contains 
lust  one  element  r\  we  write  p*(r,  t.  Lt)  K  r*  (instead 
of  |r'j).  Similarly  for  *•  0.  p*(r.Zo.  I).  end  simi¬ 

larly  for  gL . 

If  reff.  define  Rt  *  Similarly  I,  *  I-Wi 

An  input  or  output  labeled  r/Ff  where  ffecff.  and  r 
does  nof  belong  to  ffe,  Hieans  normal  input  or  output  r. 
and  faulty  input  or  output  some  member  of  ffc  If  ffe 
contains  just  one  element  r\  we  write  r/r  .  Some¬ 
times.  when  no  confusion  arises.  r/F 0  will  also  be 
called  a  fault  For  simplicity  r/ff,  will  somstimes  be 
written  as  r/  •. 

Ws  also  dtflne  Arfri/r*  s,  I)  «  r\/r  t  if 
9b(t i*  •  1 1)  ■  v'|  and  p*(r,.  s ,  I)  «  r and  ri  e  rf. 
v’|  p  f'j  Ws  dsfins  similarly  p*(r .  e .  i,/it)  *  <  i/l  t. 
and  similarly  for  $  i.  Another  definition  is 

in  (r/Xt.f.t)mrVF'i  U  f«(r.t.<)  ■  r\  and 

I)  c  F  Similarly  for  p*(r,  e.  1/  Lt)  snd  for 
Pi-  (Note  that  gx(r/ff».  ••  0  to  not  uniquely  defined, 
since  F  t  msy  be  any  superset  of  pjr(ff».  1. 1),  not  con¬ 
taining  r  .)  If.  according  to  the  above,  the  input  of  a 
cell  is  a  fault  r/fft  and  its  output  ia  r  /  F  t  we  say  that 
this  csll yropagatwM  the  fault  r/ Ff  Notice  that  accord¬ 
ing  to  our  convention  r’  does  not  belong  to  F V  ao  we 
can  distinguish  between  the  normal  and  the  faulty  out¬ 
put. 

In  the  s equal,  p  denotes  the  total  number  of  cells 
In  the  array. 

W  feme  General  Testability  Conditions 

Lot  V  bs  tbs  following  sat  of  conditions 
Cl:  for  every  raff  there  exist  r  tff.se/  such  that 

Ct:  for  every  fc/  there  exist  fti.ee/  such  that 

fx(ff.  e,f)»l. 

01:  for  every  rj.eBcff  with  e,  e  r#.  there  exist 

lei.  tcZ  such  that  yg(r|.  1. 1)  e  p*(rt.  s.  I) 

OR:  for  every  l,,lBei  with  there  exist 

reff,  ec /  such  (helper,  s.  Ij)  e  pi(r.  e.  I*V 

Theorem  1:  Any  bilateral  array  ef  combinational 
cells  that  aa  Us  flat  conditions  V  is  testable  for  amglt 
faulty  tells 
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Tbe  proof  la  similar  is  spirit  to  the  proof  of  the 
next  theorem,  and  it  it  actually  simpler.  So  it  U 
omited.  We  only  mention  that,  under  conditions  V.  to 
test  tbe  entire  array  completely  we  Dead 

P‘(p*l)'ma*(|jr|,|l|)  \X\\Z\\L\  tests. 

bet  W  be  the  following  set  of  conditions  (see  rigs.  3 
and  4): 

Cl:  for  every  r c/f  there  exist  e'e/f,  ecZ  such  that 
fjifr'/*.  *.  1)  *  f/*« 

C2:  for  every  l  Cl  there  exist  I  *cZ.  a  cZ.  VC/?  such 
that gi(r\  a.  **)  *  I  and  e.  I*)  ■  r/*. 

01:  for  every  r  Cf?  there  exist  ecZ  such  that 
p*(r/*t  s.  I)  *  r'/m. 

02:  for  every  lt.  lt  cl  with  it  0  la.  there  exist 
rc^,*c2  such  that px(r <  *•  A)  *  fx(r*  *• Ip). 
Conditions  W  bold  if  1)  the  basic  call  just  transmits 
unaltered  the  right-moving  signal,  2)  $p{P,Z.L)  ■  L,  3) 
for  any  two  different  right  inputs  there  exist  r,  t  that 
produce  two  different  left  outputs.  This  is  a  quite  rea¬ 
sonable  set  of  assumptions. 

Theorem  &  Any  bilateral  array  of  combinational 
cells  that  satisfies  conditions  W  is  testable  for  single 
faulty  cells. 

Proof.  Assume  we  want  to  test  cell  j  tor  inputs 
(re,  ec.  !0).  These  test  inputs  will  be  applied  at  time 
f*2p->  if  the  test  begins  at  time  1*1.  Hence 
rc  *  lc  *  ^(2^-;).  ec  «  e>(2p->)  (see  Fig  5. 

shaded  cell). 

First  we  mutt  make  the  left  input  of  cell  jf  at  time 
2 p-y  be  rc*r*(2o-y).  Condition  Cl  guarantees  the 
existence  of  r*-*(2p-y-l),  e*"l(2p-/-l)  such  that 
"*(2? -*-*)♦  e*”*(2p-j-t),  L)\  hence 
it  suffices  to  apply  r* ~l(2p -y’ -1).  e*_,(2p-y-l)  as  left 
and  vertical  inputs  to  cell  y  — 1  (at  time  2p-y-l) 
Inductively,  apply  ri ",(2p -y -2),  »^",(2p-y-2)  to  cell 
y  -2,  etc.,  until  we  reach  tbe  leftmost  cell.  Tbe  tricky 
part  is  to  apply  right  input  lQ  =  l*(2p  -y )  to  cell  j.  Con¬ 
dition  C2  guarantees  the  existence  of  rfM(2p  — jf -1), 
i/4,(2p-y- 1).  e'4l(2p-y-l)  such  that  l'(2p-y)  * 

Mid 

r>4,(2y>-y )/ *.  Hence  it  auffices  to  apply  input 
(rJ*‘(2p-y  -l).  I*#1(3p-J-I),  «?4,(2p->-l))  to  cell  j+1 
(at  time  2y>-^-l).  Left  input  r**l(2p -y-1)  and  right 
input  i*4l(2p  -y-1)  can  be  applied  (recursively)  in  the 
same  way  we  applied  inputs  re  and  lt  to  cell  y  (using 
Cl).  This  solves  the  controllability  problem. 

We  have  not  yet  used  the  srtrong  part  of  condition 
C2.  namely  the  fact  that  g*(r ’/  •,  a,  r)  ■  r/ •.  The 
usefulness  of  this  will  become  apparent  in  tbe  sequel. 

Assume  tbet  the  normal  right  and  left  outputs  of 
cell  y  (on  input  (r0.  e0.  Ip))  are  r  and  I  respectively; 
assume  that  we  test  for  tbe  error  l/l  in  the  left  output 
We  can  simultaneously  test  for  all  errors^?/*  in  the 
right  output  Propagetion  of  the  error  l/l  to  the  left¬ 
most  output  is  done  using  02  and  Cl. 

We  have  not  yet  discussed  the  ’'southeast"  portion 
ofFi*  5.  that  is  the  portion  below  the  right-to-left  diag¬ 
onal  that  passes  through  tbe  shaded  cell.  First  we  have 
to  propagate  the  (suit  r/  •  (  ■  r*M(2!p-y  41)/  •  )  to  the 
rightmost  output.  But  cell  y  may  fail  to  function 
correctly  at  any  previous  time,  to  for  instance  (tee  Fig 
S)  cell  y  on  input  r*(2p-2m+y)  (for  aome  m  in 
fy  +  1,  y*2.  ,p|).  mey  not  output 

r**l(2p-2m+j  +  l).  so  cell  m  mey  not  output 
lmM(?p  -m  4 1).  hence  cell  y  may  not  receive  I*  as  right 


input,  and  due  to  e  fault,  it  may  output  the  expected 
outputs  1,  r.  Under  this  worst  case  scenario  the  two 
faults  will  be  masked  and  we  will  get  the  expected 
observable  outputs  l*(2p)  and  e*4l(3p-2y  *1)  (e**um- 
ing  y*(p  +  l)/2).  This  is  avoided  as  follows; 

First,  to  propagate  tbe  feult  r/#  to  the  rightmost 
output,  using  condition  01  we  find  a**l(2p  — y  ♦  l )  such 
that  cell  >41  on  input  W*!(2p- >41)/*  output* 
r,4,(2p-y42)/»;  inductively  we  propagate  this  fault  to 
the  rightmost  output  (r*(3p-2y -1)/ •).  Similarly,  we 
propagate  tbe  fault  r*(2p-m)/*  for 
rn*y 41^42,  •  ■  •  ,  p.  Notice  that  the  potential  previ¬ 
ous  fault  rt*l(2p  -2m  4y  ♦  1)/  •  of  cell  j  hat  been 
“automatically"  propagated  to  cell  m  as  r*(2p-m)/  • 
by  the  strong  part  of  condition  C2  "when"  we  were  solv¬ 
ing  the  controllability  problem.  So,  if  cell  y  outputs 
something  different  from  r*M(2p  -2m  ♦>  4 1),  we  shall 
detect  it  at  the  rightmost  output  by  getting  something 
different  from  the  expected  r*M(3p  -2m  +  1).  This 
solves  the  observability  problem. 

The  above  procedure  is  repeated  for  every  1  in  X,; 
then  we  have  tested  cell  >  for  input  (r*  s0.  te).  This  is 
repeated  for  every  (r.t.l)  in  R*Z*L:  then  we  have 
tested  cell  y  completely,  e 

The  testing  time  shown  in  Fig  6  is  2p.  Hence  to 
test  cell  y  for  input  (r0.  *0.  Ip)  we  need  2p  l  L  )  tests, 
hence  to  test  completely  cell  we  need 

2p  | Zr  | •- 1  Jf  1  - 1 Z I  tests,  and  to  test  completely  the 
■  array  we  need  2pB  |L  lr  I/?  I  I Z  !  tests.  Note  that  if 
tome  cell  in  tbe  space-time  transformation  is  '  used  at 
time  t.  it  is  never  used  at  time  f  4l.  hence  the  obvious 
pipelining  reduces  the  Leeting  time  to  one-half  of  the 
above  number  of  tests 

4.  One-etep  Testability 

Parthasarathy  and  Reddy  [4]  introduced  the  notion 
of  one-etep  testability  for  unilateral  arrays  We  extend 
this  notion  to  bilateral  arrays  as  follows 

Doftnilion.  A  cell  in  a  bilateral  array  of  combina¬ 
tional  cells  is  one-step  testable  for  input  (r.  s.  I)  if  tbe 
number  of  time  units  needed  to  test  this  cell  for  input 
(r ,  • .  I )  is  independent  of  I R  i .  \L  I . 

Definition.  A  cell  in  a  bilateral  array  of  combina¬ 
tional  cells  is  one-etep  testable  if  it  is  one-step  testable 
for  all  inputs  (r.  a.  I)  in  R*Z*L. 

Definition:  A  bilateral  array  of  combinational  cells 
is  one-step  testable  if  all  its  cells  are  one-step  testable 

Tbe  notion  of  one-etep  testability  is  important  for 
the  following  reason:  if  an  array  is  one-step  testeble.  the 
time  needed  to  test  it  is  greetly  reduced  since,  if  the 
expected  output  of  a  cell  under  test  is.  aey.  l.  it  is  not 
necessary  to  apply  different  test  inputs  for  each  fault 

l/tlc 

The  following  conditions  art  useful  for  one -step  tes¬ 
tability  (see  Fig.  6): 

OSTl:  for  every  r eff  there  exist  IcX.  ecZ  such  thst 

*„(«■/*,.  1,  t 

0ST2;  for  every  ItL  there  exiet  rC/f.scZ  such  thet 
fi(r.  e.  I/A)  *1/ A 

Let  OST  be  conditions  OSTl  and  0ST2  together  Let 
V-OST  (W-OS***)  be  conditions  V  (W)  and  OST  together. 

If  W-OST  bold.  Instead  of  testing  for  tbe  feult  l/l  tor 
each  Tin  A  as  in  tbe  proof  of  theorem  2,  we  cen  test  for 
the  fault  i/A  Thus,  to  test  cell  y  for  input  (rc.  *0.  <c). 
we  only  need  9p  time  units.  Therefore,  if  we  want  to 
Ust  all  cells  for  inputs  a  subset  /  of  R*Z*L,  we  need 


Bp*  !  / 1  time  units.  If  V-OST  bold,  similarly  as  above,  by 
testing  for  all  output  faults  simultaneously  (on  a  given 
input),  we  need  p  (p  ♦  1)  |/|  tests  if  we  want  to  test  all 
cells  for  a  subset  /  of  R  *7*1- 

ft.  Conditions  for  Testability  in  linear  Time 

Let  L  be  the  following  set  of  conditions  (see  fig.  7): 
OCl:  For  every  rt,  rt  c/f  with  rxprt  and  for  every  1  cl 
there  exists  I*  cl  end  sc  Z  ouch  that 

fjrfrj.  e.  I  )  *»  e.  t)  and  gL(rlt  a,  I')  »  J. 

0C2:  For  every  1 1.  it  Cl  srith  and  for  every  rcJ? 

there  exists  r*  tR  and  eC  Z  such  that 

$L{r\  a.li)*  $l{t\  i,  l*)  and p*(r'#  a.  It)  *r. 

Theorem  &  Condition  Cl  of  V  and  L  is  a  sufficient 
set  of  conditions  for  testability  of  the  entire  array  In  a. 
number  of  time  units  proportional  to  the  length  of  the 
array. 

Proof:  We  shall  give  a  constructive  proof.  To  keep 
things  simple,  we  shall  work  on  an  example;  the  gen¬ 
eralisation  is  straightforward.  Consider  the  space-time 
transformation  of  an  array  of  6  cells  as  in  Fig  ft. 
Assume  that  each  cell  must  undergo  2  tests,  that  is  it 
must  be  tested  for  inputs  (r,.  Z,)  (twl,2)  for  faults 
r  ,/f<t  i\/\  (i.e.  are  the  normal  outputs  and 

f%.  X  the  faulty  outputs).  In  Fig  6  cells  under  test 
are  depicted  as  full  squares;  to  the  inputs  and  outputs 
of  these  cells  are  ‘  given  ’. 

Let’s  group  the  cells  of  the  spsce-time  transforma¬ 
tion  diagonally,  forming  different  groups  on  the  left  and 
right  of  the  cells  under  test,  as  in  Fig.  8.  Call  these 
groups  left  and  right  diagonal  groups.  Note  that  each 
left  diagonal  group  of  c  cells  has  one  left  input,  one 
right  output,  c  right  inputs  and  c  left  outputs  Its  right 
inputs  will  typically  be  faults  that  have  to  be  pro¬ 
pagated  to  its  left  outputs.  Fig.  8  shows  a  left  diagonal 
group  containing  cells  k  through  m,  at  time  units  t 
through  t+m-fc.  Assume  that  the  faults 
♦<)  for  .  m-k  are  given  and 

have  to  be  propagate i  to  its  left  outputs;  also  output 
+m  ♦!)  must  be  generated.  We  shall  show  bow 
to  find  its  left  input  r*(t)  that  does  that,  tiling  condi¬ 
tion  OCl  we  can  find  input  rw(f  4-m-fc)  that  propagates 
the  fault  I* (f  )/Tm(t  )  and  generates  the 

output  r**l(f  +m  -k  +1).  Now  rm(f4k-ws)  has  to  be 
generated,  so  inductively  we  can  find  r*  (f ).  Hence  each 
left  diagonal  group  can  propagate  to  its  left  outpute  any 
faults  on  its  right  inputs  and,  at  the  same  time,  it  can 
generate  any  right  output.  Entirely  symmetrical  things 
bold  for  each  right  diagonal  group. 

We  apply  the  construction  stated  above,  starting 
from  the  left  diagonal  group  at  the  top  of  Fig  8  (which 
contains  just  one  cell);  that  is.  we  find  its  left  input  so 
that  the  fault  r‘|/rt  ie  propagated,  and  output  eB  is  gen¬ 
erated  Then,  for  the  next  left  diagonal  group  (second 
from  the  top),  we  know  the  faults  that  it  has  to  pro¬ 
pagate.  If  wa  were  testing  the  leftmost  cell  for  one 
more  input,  we  would  elso  know  the  output  that  bes  to 
be  generated  by  this  diagonal  group;  since  this  is  not 
the  cast  here,  we  can  choose  it  arbitrarily.  Therefore, 
we  can  apply  again  that  construction  to  find  its  left 
input  that  does  this  We  proceed  this  way,  from  top  to 
bottom.  (Inductively,  each  diagonal  group  will  have  to 
propagate  some  faults  and  generate  an  output;  as  we 
proved  above,  there  exists  an  input  that  does  this.)  We 
repeit  the  same  procedure  for  the  right  diag°n*l 
groups,  proceeding  again  from  top  to  bottom.  Obviously 
it  does  not  hurt  if  some  of  the  left  Inputs  of  a  right  diag¬ 


onal  group  are  not  faults  that  have  to  he  propagated,  as 
for  the  first,  third  etc.  groups  in  Fig  6 

For  the  first  p-2  left  diagonal  groups  their  left 
inputs  are  "inside”  the  array,  so  wa  have  the  additional 
problem  of  how  to  generate  them.  Condition  Cl  of  V 
takes  care  of  this,  (as  in  theorem  2). 

If  each  call  requires  T  tests.  (T+l)p  is  the 
number  of  inputs  that  have  to  be  applied  to  the  left 
boundary  of  the  array  (see  Fig  8);  additionally,  p- 1 
time  unite  are  required  for  the  propagation  to  the  right 
boundary  of  the  faults  in  the  leftmost  cell  Hence  the 
entire  array  can  be  tested  in  in  (7>l)p4p-l  time 
units.  Again,  the  obvious  pipelining  saves  about  half  of 
these  time  units.  • 

ft.  Conditions  for  One-etep  Testability  in  linear  Time 

Let  us  consider  now  one-step  testable  arrays.  It 
turns  out  that  it  suffices  to  replace  the  fault  r,/rf 
(resp.  in  conditions  L  by  the  fault  r/*  (resp 

1/  •  ).  This  way  we  obtain  tbe  following  aet  of  conditions 
L-OST. 

OCl;  For  every  re/?  and  for  every  l Cl  there  exists 
I'd  and  acZ  such  that  0x(r/*.«,Z)* 

0*(r,e.r)/»  and  0i(r.  s.  f)  *  I. 

OC2:  For  every  Zcl  and  for  every  rtR  there  exists 
r  cR  and  acZ  eucb  that  gL(r‘ .  *,  1/ • )  * 

0i(r\  «.D/*  andg*(r  .  e.  1)  *  r. 

Theorem  4:  Condition  Cl  of  V  and  L-OST  is  e 
sufficient  set  of  conditions  for  one-step  testability  of  tbe 
entire  array  in  a  number  of  time  units  proportional  to 
the  length  of  tbe  array. 

Proof.  Analogous  to  the  proof  of  theorem  3.  If 
each  cell  is  tested  for  a  subset  /  of  itc  inputs,  the 
number  of  time  units  required  is  ( 1/ 1  +  1)  p4p-l.  » 

Romork.  All  tbe  above  results  are  easily  general¬ 
ized  for  the  case  when  g gL  are  not  identical  for  every 
cell,  that  is  we  have  pi.  gl  for  the  <- th  cell;  it  suffices  to 
replace  the  conditions  for  g *.  gi  by  conditions  for  p>. 
gl  for  every  t . 

T.  Application 

Figure  10  shows  tbe  basic  cell  of  e  two-way  pipeline 
systolic  array  used  for  FIR  filtering  [5],  [6].  For  this  cell 
we  have  \Z  \  *  1  (no  e-inputs),  p*(r,  ()  *  r. 
gi(r,  1)  m  l+a  r.  This  array  can  be  considered  os  a 
bilateral  array  of  combinational  cells  at  the  word  level 
(the  basic  time  unit  is  the  time  required  to  produce  the 
outputs)  It  easy  to  see  that  conditions  W-OST  are 
satisfied.  (Here  we  have  tbe  case  when  p*.  gL  depend 
on  the  cell.)  Therefore.  If  a  subset  /  of  R*L  suffices  to 
test  the  basic  cell  I / 1  tests  suffice  to  test  the 

array.  Also,  If  wa  ignore  overflow  problems  assuming 
for  all  r.  ft+«  rwlf+a  r  for  fj eft.  conditions  L-OST 
bold,  hence  for  this  case  0(p  |/|)  tests  suffice  to  test 
the  array.  Note  that  tbe  inequality  above  does  not  have 
to  hold  for  all  r.li.  JB:  obviously,  it  suffices  to  bold  for 
the  signals  that  appear  as  inputs  or  outputs  in  the  test 
described  in  tbe  proof  of  theorem  3. 
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Fig  5  The  test  described  in  thm  2  (vertical  inputs 
are  not  shown  for  simplicity1) 
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Fig  9  A  left  diagonal  group 
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10  The  banc  cell  of  a  two-way  pipeline 
•yslobc  array  for  FIR  filtering 

Testing  an  array  that  aatifles  conditions  L 
Calls  under  test  are  depicted  as  full  squares 
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Abstract: 

We  present  a  method  of  creating  circuits  which  are 
easily  tested  for  all  single  stuck-at  faults  using  a  con¬ 
stant  number  of  test  vectors.  The  method  is  the  combi¬ 
nation  of  a  number  of  techniques.  It  uses  special  ‘‘con¬ 
trollable"  logic  gates  and  latches.  It  requires  that  com¬ 
binational  logic  aubcircuits  be  bipartite,  which  is 
achieved  by  transformation  if  necessary.  The  method 
was  previously  presented  for  nMOS  combinational  logic 
In  this  paper,  we  extend  this  method  to  both  CMOS  and 
to  sequential  circuits.  We  also  discuss  alternate  methods 
of  achieving  bipartiteness  during  tasting 


1.  Introduction 

In  [1*83],  we  presented  a  new  approach  to  the  pro¬ 
duction  testing  of  VLSI  circuits.  This  approach  gives 
100*  single  stuck-at  fault  coverage  of  circuits  using  a 
constant  number  of  lest  Victors.  It  alto  cover*  many 
multiple  faults.  Generating  test  vectors  is  very  fast,  in 
fact,  it  does  not  require  ony  searching,  only  a  one  pass 
analysis  of  the  circuit.  Our  method  has  tremendous 
advantages  over  traditional  methods  in  getting 
guaranteed  high  fault  coverage  without  the  high  costs  of 
searching  for  good  test  vectors  and  applying  large  acts  of 
test  vectors.  One  of  its  great  advantages  is  that  the  set 
of  test  vectors  is  small  enough  to  be  stored  on-chip,  giv¬ 
ing  deterministic  self-testing  The  approach  does  have 
penalties  —  primarily  in  circuit  area  but  also  in  speed. 
However,  we  believe  the  advantages  will  outway  the  costs 
in  many  situations. 

Our  approach  is  the  combination  of  three  tech¬ 
niques  which,  in  fact,  could  be  used  separately.  The  first 
is  the  use  of  circuits  with  the  property  of  being  bipartite; 
the  second  is  the  use  of  special  controllable  logic  ele¬ 
ments.  the  third  is  the  use  of  small  amounts  of  logic  to 
observe  the  values  of  internal  nodes.  Originally,  the 
approach  was  presented  for  nMOS  technology  and  pri¬ 
marily  combinational  logic.  The  purpose  of  this  second 
paper  is  to  give  extensions  of  the  method  to  sequential 
logic  and  to  CMOS  Also,  we  will  explore  variations  of  the 
method  which  address  tradeoffs  in  fault  coverage,  area, 
speed,  and  fault  tolerance  We  are  particularly  con¬ 
cerned  with  providing  alternatives  when  the  area  panalty 
of  our  basic  technique  is  too  costly. 
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t  The  Approach 

Our  approach  is  baaed  on  the  ease  with  which  any 
wire  in  a  bipartite  circuit  can  be  controlled  to  the  value 
0  or  1.  A  bipartite  circuit  Is  a  combinational  logic  circuit 
whose  gates  can  be  colored  black  and  white  such  that  no 
two  gates  of  the  same  color  are  connected  Given  such  a 
coloring  of  the  gates,  the  wires  can  also  he  colored  so 
that  aacb  wire  inherits  the  color  of  the  gate  to  which  it  is 
input.  Circuit  output  wires  inherit  the  opposite  color  of 
the  gate  from  which  they  are  output 

Call  a  logic  gate  Inverting  if  it  has  output  1  when 
presented  with  all  O  s  as  input  and  0  when  presented  with 
all  I  s  as  input  Inverter.  NAND.  and  NOR  gates  are 
inverting  If  a  bipartite  circuit  consists  of  inverting  logic 
gates,  then  every  wire  can  be  forced  to  the  values  0  and 
1  using  just  two  input  vectors.  (This  was  observed  for 
NOR- equivalent  circuit*  by  Akers  [Ak7g] )  The  property 
is  centra]  to  our  approach;  we  call  it  the  parity  principle 

Parity  Principle:  //  urn  oat  all  the  black  input  wires 
of  an  inverting  logic  bipartite  circuit  to  the  value  V  and 
all  the  white  input  wires  of  the  circuit  to  the  value  V . 
tAerx  ail  black  freepectively  white)  wires  take  on  the 
value  V  freepectively  V) 

We  use  the  stuck-at  fault  model  [Br76]  in  which  each 
input  to  a  logic  gate  and  aach  output  from  a  logic  gate 
may  he  independently  stuck-at  0  or  stuck-at  1  This 
model  includes  stuck  on  or  stuck  off  faults  for  a  MOS 
transistor  since  these  are  equivalent  to  stuck-at  0  or 
■tuck-at  1  faults  on  the  gate  (control  input)  of  the 
transistor. 

To  uae  the  parity  principle  in  a  test  strategy,  two 
test  vectors  are  applied  one  with  all  white  input  wires 
set  to  0  and  all  black  input  wires  set  to  1  and  the  other 
with  these  values  reversed  These  will  excite  any  stuck- 
at  fault  at  the  output  of  a  logic  gate  However,  note  that 
in  a  fault-free  circuit,  all  inputs  to  an  individual  logic 
(•U  will  be  equal  Thus  not  all  stuck-at  faults  at  the 
inputs  of  logic  gates  will  be  tested  for  example,  given  a 
(two  input)  NOR  gate,  to  detect  an  individual  input 
stuck-at  0  requires  that  value  1  be  applied  to  the  stuck- 
at  input  and  0  be  applied  to  the  other  input.  Therefore, 
using  the  parity  principle  to  test  bipartite  inverting  logic 
circuits  allows  one  to  check  that  each  logic  gate  output 
can  take  on  the  values  0  and  1  but  does  not  catch  each 
stuck-at  fault  in  a  transistor  or  at  an  input 

To  catch  all  stuck-at  faults  in  a  bipartite  inverting 
logic  circuit,  we  replace  each  logic  gate  with  a  special 
gate.  This  gate  urns  extra  global  control  inputs  to  catch 
all  stuck-at  faults  for  the  gate.  In  [La63].  an  nMOS  two 
input  NOR  gate  was  prssented  for  use  in  circuits  contain¬ 
ing  only  NOR  gates  and  inverters.  An  nMOS  NASD  for 
NAND/inverter  circuits  is  similar  and  NOR  and  NAND 
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(•tea  (requiring  mart  control  variables)  for  uac  In  mixed 
N AND /NOR  circuits  ere  extensions  of  these.  In  the  next 
section.  «e  present  e  controUeble  NOR  for  CMOS 

Given  e  circuit  of  controUeble  getes.  the  number  of 
teat  vectors  increases  as  a  function  of  the  number  of  glo¬ 
bal  control  inputs  used.  For  a  circuit  consisting  only  of 
controllable  two  input  NOR  getes  (inverters  are  created 
by  using  an  NOR  gate  with  both  inputs  the  same),  the 
number  of  test  vectors  increases  from  two  to  five.  Tbe 
purpose  of  the  five  test  vectors  will  become  clear  in  tbe 
next  section  when  tbe  CMOS  NOR  is  presented  in  detail 
Because  combinational  circuits  are  feedback-free,  any 
faulty  circuit  will  have  at  least  one  gete  whose  inputs 
come  from  fault-free  gates  but  whose  output  is 
incorrect.  This  is  detected  by  observing  tbe  output  of 
•very  gate. 

Normally,  observing  tbe  output  of  every  gate  would 
be  an  impoesible  task  for  a  LSI  or  VLSI  circuit.  There  are 
far  too  few  pins  available,  and  mechanical  probing  is 
difficult.  A  scanning  electron  microscope  can  be  used 
for  such  observation  [KiB2].  but  its  use  is  not  practical 
for  production  testing  of  a  large  number  of  chips  and 
prohibits  field  testing  However,  in  our  approach  there 
are  very  few  events,  i.e  combinations  of  values  of  fate 
outputs,  that  we  wish  to  observe.  Tbe  basic  technique 
using  a  bipartite  circuit  with  controllable  NOR  gates 
requires  the  observation  of  three  events  all  white  gates 
output  0  and  aU  black  gates  output  1;  all  white  gates  out¬ 
put  1  and  all  black  gates  output  0.  all  gates  output  1. 
Handling  special  input  pads  and  sequential  logic  will 
increase  the  number  of  events  by  two 

Because  there  are  so  few  events  using  our  approach, 
we  can  add  extra  circuitry  to  observe  these  events 
Each  event  is  observed  by  using  a  large  fan -in  AND  to 
observe  all  outputs  which  should  be  1  and  a  large  fan-in 
OR  to  observe  all  outputs  which  should  be  0.  By  phyn- 
eally  distributing  these  large  fan-in  gates,  wc  can  keep 
tbe  increase  in  circuit  area  small.  Note  that  the  number 
of  such  large  fan-in  gates  needed  is  proportionally  to  the 
number  of  distinct  events  to  be  observed.  Most  test 
methods  require  the  observation  of  a  large  number  of 
events  Thus,  although  the  observation  circuitry  could 
be  added  to  any  circuit,  tbe  amount  of  circuitry  needed 
would  be  prohibitive  for  most  test  methods  This  method 
of  observing  internal  values  can  be  used  with  other  test 
methods  requiring  tbe  observation  of  only  a  few  events 
(e.g.  [Ha74.  Sa74]). 

3.  Tbe  Approach  for  CMOS 

Tbe  CMOS  oantroUable  NOR  gate  is  a  modification  of 
tbe  standard  ratioless  pull  up- pull  down  CMOS  NOR  [Ho83] 
shown  in  Figure  la.  Figure  lb  shows  the  new  gate  with 
additional  inputs  Ct  and  Ct  This  gate  is  used  in  circuits 
consisting  only  of  inverters  and  NORs;  tbe  inverter  is 
creating  by  using  a  NOR  with  both  inputs  the  same. 
Inputs  Cj  and  Ct  ere  global  to  tbe  circuit  -  every  gate  in 
tbe  circuit  contains  them.  For  normal  operation.  C>  and 
Ct  have  tbe  value  1.  Table  1  show*  tbe  values  to  be  used 
to  excite  eacb  possible  stuck-st  fault  in  tbs  gsts.  Note 
that  an  n-type  (respectively  p-type)  transistor  stuck  on 
Is  equivalent  to  its  input  stuck-at  1  (respectively  stuck* 
at  0);  and  n-type  (respectively  p-type)  transistor  stuck 
off  is  squtvalent  to  its  input  stuck-at  0  (respectively 
stuck-at  1). 

Id  Table  1,  note  that  whenever  one  of  C}  and  Cf 
takes  on  the  value  0  end  tbe  other  takes  on  tbe  value  1,  a 
fault-free  gate  is  Inverting  with  respect  to  normal  inputs 
x  and  v  Thus,  for  these  values  of  Ct  and  CK.  tbe  parity 


principle  bolds  and  gates  can  be  tested  simultaneously  - 
white  and  black  gates  being  tested  for  opposite  faults 
To  test  stuck-at  1  faults  at  Ct  or  Ct.  both  are  set  to  0  In 
this  ease,  the  normal  inputs  and  outputs  of  all  gates 
should  be  1;  again,  all  gates  can  be  tested  simultane¬ 
ously. 

Table  1  shows  that  aacb  single  transistor  fault  in  the 
CMOS  gate  results  in  tbe  gate  output  being  either  electri¬ 
cally  isolated,  denoted  floating,  or  on  s  path  from  VDD  to 
ySS.  denoted  oA ori.  In  nMOS,  because  tbe  logic  is 
ratioed.  a  abort  provides  a  valid  logic  0  [MeBO]  However, 
in  CMOS,  this  may  not  be  tbe  case.  Also,  in  nMOS  a  float 
can  occur  only  if  the  depletion  mode  pullup  transistor 
which  is  normally  on  is  stuck  off  In  CMOS  however,  all 
tbe  individual  stuck  off  faults  for  transistors  cause  float¬ 
ing  outputs  This  is  necessarily  tbe  case  in  ratioless 
CMOS  sinoe  for  each  input  combination,  there  should  be 
either  a  path  from  VDD  or  a  path  from  VSS  to  the  output 
of  a  gate  To  detect  a  stuck  off  transistor,  such  a  path 
must  be  broken,  causing  a  floating  output  Similarly, 
any  single  transistor  stuck  on  will  eeute  a  VDD- VSS  shori 

To  detect  a  abort,  wa  may  choose  in  tbe  design  of 
tbe  controllable  gate  to  aixe  transistors  so  that  such 
aborts  appear  as  a  valid  logic  value.  This  may  cause 
eome  transistor  faults  to  be  undetectable.  The  most 
desirable  such  ratioing  is  to  moke  the  p-type  C,  and  Ct 
transistors  with  high  resistance  and  tbe  n-type  Cj  and  Cj 
transistors  with  low  resistance.  Hus  is  also  desirable  for 
good  performance  of  tbe  logic  gale  in  normal  operation. 
Tbe  remaining  transistors  are  adjusted  so  that  any  short 
leaves  tbe  output  at  a  logic  0  (as  for  nMOS).  Then,  on  n- 
type  transistor  stuck  on  will  be  detected  as  yielding  an 
incorrect  logic  value;  a  p-type  transistor  stuck  oo  will 
give  tbe  correct  logic  value  but  at  a  higher  voltage  within 
tbe  allowable  range 

Bixing  transistors  to  handle  eborts  negates  one  of 
tbe  great  advantages  of  ratioteea  CMOS  -  tbe  abUity  to 
design  logic  gates  so  that  transitions  to  both  1  and  0  are 
at  approximately  tbe  tame  spaed.  If  a  VDD-VSS  short 


c ar  ®e  detected  by  detecting  high  leakage  current  on 
ibt  power  bus  [Me B2.  AcB3],  this  siting  eon  be  elim¬ 
inated  Die  controlieble  gete  ctn  be  designed  for 
optimum  fault-free  performance.  This  is  the  method  of 
choice  for  detecting  shorts. 

To  detect  erroneous  but  logically  valid  outputs  and 
floating  outputs,  we  must  observe  the  values  of  each  out¬ 
put  gate.  Possible  distributed  AND  and  OR  configurations 
for  CMOS  are  shown  in  Figure  2.  These  are  similar  to  cir¬ 
cuits  which  msy  be  used  in  nMOS.  In  CMOS  we  use  only 
n-type  transistors  and  propagate  only  logic  0  through 
them.  To  use  either  type  of  observation  circuit,  we  hold 
all  circuit  inputs  at  a  test  value  while  doing  the  following. 
The  observation  logic  output  is  set  to  logic  1  by  a 
separate  source  while  holding  the  observation  logic  Input 
at  1.  Then,  the  separate  source  is  disconnected  and  the 
observation  logic  input  is  set  to  0.  for  the  AND 
configuration,  this  0  will  propagate  to  the  output  If  all 
wires  being  observed  are  correctly  at  value  1.  For  the  OR 
configuration,  the  0  will  propagate  to  the  output  if  some 
wire  being  observed  is  incorrectly  at  value  1.  Figure  2 
show*  the  basic  construction.  In  practice,  drivers  may 
be  needed  so  that  the  observation  logic  does  not  have 
too  severe  s  delay.  Note  that  this  delay  only  affects  the 
time  to  test  the  circuit  —  the  observation  logic  is  not 
used  during  normal  operation.  Since  there  are  very  few 
test  vectors  to  be  applied,  we  can  afford  a  longer  delay 
for  each  test  vector  than  If  we  were  using  s  method  In 
which  thousands  of  test  vectors  were  applied  This  delay 
is  a  disadvantage  in  that  it  may  prohibit  “at  speed*’  test¬ 
ing  of  the  circuit. 

To  detect  floating  outputs,  we  use  the  fact  that  auch 
an  output  will  hold  it's  old  value  for  several  circuit 
delays  We  sequence  the  test  vectors  so  that  the  correct 
values  of  gate  outputs  alternate  between  0  and  1.  Thus, 
when  a  gate  output  is  floating,  it  will  hold  an  old  value 
different  from  the  expected  value.  Note  that  there  are  in 
fact  only  five  test  vectors 
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If  we  use  the  sequence  i,.  *t,  v9.  v4.  vl%  v§.  vt.  vt.  then 
the  outputs  of  blsck  gates  should  take  on  the  sequence 
of  values  1.0. 1.0.1, 1.0,1  and  the  outputs  of  white  gates 
should  take  on  the  sequence  of  values  0,1, 0,1,0. 1,1.1.  In 
this  case,  for  each  gate,  each  test  vector  is  applied  at 
least  once  when  the  previous  output  of  the  gate  is  oppo¬ 
site  from  the  expected  output  for  the  test  vector. 

4.  Sequential  Logic 

We  have  described  our  approach  as  applied  to  com¬ 
binational  circuits.  We  now  consider  sequential  circuits 
composed  of  bipartite  combinational  logic  eubcircuits 
separated  bv  simple  latches  We  will  consider  only 
latches  with  one  data  input,  one  output,  and  one  latch 
signal  When  the  latch  signal  is  high,  the  value  of  the  date 
input  is  stored  The  output  is  equsl  to  the  last  input 
value  stored  and  is  always  available.  Given  such  a 


sequential  circuit  with  bipartite  combinational  eubcir¬ 
cuits,  a  scan  path  technique  [WiBl]  could  b*  used  to  gain 
access  to  the  latches.  Our  technique  covld  be  used  for 
the  combinational  portions  by  loading  and  unloading  the 
latchss  through  the  scan  paths.  The  sequential  loading 
and  unloading  of  the  latches  is  slow.  Even  worse,  for 
CMOS  where  the  sequence  of  test  vectors  is  important, 
the  sequential  loading  may  cause  serious  problems  in 
testing  for  floating  output  nodes,  tbs  exact  design  of  the 
scan  path  latch  will  determine  what  is  feasible  However, 
there  is  an  alternative.  If  a  special  controllable  ialcb  is 
use  in  place  of  the  simple  latch,  sequential  circuits  with 
bipartite  combinational  subcircuits  can  be  tested  using 
our  approach  without  the  use  of  ecan  path  circuitry 

The  testing  of  sequential  circuits  using  the  controll¬ 
able  latch  breaks  up  the  circuit  into  controllable  combi¬ 
national  pieces  just  as  scan  path  techniques  do  How- 
over,  instead  of  loading  the  latches  with  values  during 
testing,  each  controllable  latch  has  special  test  modes 
In  each  test  mode,  the  outputs  of  s  latch  are  the  specific 
vaiues  needed  for  the  current  test,  regardless  of  the 
stored  value.  Ihus,  in  addition  to  the  data  input  and 
latch  signal,  the  controllable  latch  will  have  mode  inputs 
In  the  most  general  case,  the  latch  output  may  be  con¬ 
nected  to  both  black  and  white  gates.  In  this  case,  the 
latch  will  have  five  modes:  (i)all  outputs  equal  to  Last 
stored  input  (normal),  (ii)  all  outputs  equal  to  1,  (in)  all 
outputs  equal  to  0,  (iv)  all  outputs  to  white  gates  equal  to 
0  and  outputs  to  black  gates  equal  to  1.  (v)  all  outputs  to 
white  gates  equal  to  1  and  outputs  to  black  gates  equal 
to  0.  (Note  that  for  a  NOR/inverter  circuit  mode  (m)  is 
not  needed,  but  in  a  NO  R/N  AND /inverter  circuit  it  would 
be.)  Modes  (ii)  through  (v)  are  test  modes  and  ignore  the 
value  actually  stored  in  the  latch.  During  testing  the 
outputs  (black  and  white)  o!  the  controllable  latch  are 
observed  in  the  same  fashion  as  the  output  of  a  logic 
gate.  Figure  3  gives  one  possible  design  for  such  a  gate 
in  nMOS. 

The  test  sequence  using  controllable  latches  is  as 
follows; 

1.  Test  combinational  logic.  Set  the  modes  using  (ii) 
through  (v)  above  as  appropriate  to  test  the  combi¬ 
national  logic  pieces.  Observe  that  correct  test 
values  are  on  outputs  of  each  latch 
D.  Test  latching  of  input.  The  data  input  to  each  latch 
is  either  white  or  black:  we  call  the  latch  white  or 
black  accordingly.  Since  the  combinational  logic 
has  been  tested,  we  can  use  it  to  set  s  known  value 
in  aacb  latch. 

A.  Using  mode  (iv).  aet  the  inputs  to  all  white 
latches  to  0  and  the  inputs  to  all  black  latches 
to  1.  Raise  all  latch  signals  so  these  values 
abouid  be  stored. 

8.  Lower  all  latch  signals  Use  mode  (i)  to  pro¬ 
pagate  latched  values  to  observable  outputs 
Repeat  A  and  R  using  mode  (v)  in  A  and  again  using 
mode  (iv)  in  A.  For  each  latch,  this  will  test  the 
transition  from  storing  s  0  to  storing  s  1  and  the 
transition  from  storing  a  1  to  storing  a  0  In  B.  only 
tbs  values  et  latch  outputs  are  observed  since  the 
distinction  between  white  and  bleck  wires  is  not 
maintained  at  this  step.  This  introduces  two  new 
•vents  to  observe. 

Note  that  to  execute  J],  the  latcb  signals  of  all 
latches  must  be  controlieble  #idepend*ni  of  the  values 
in  the  circuit  under  test.  In  many  circuits  this  will  hep 
pen  naturally  —  the  control  signals  and  data  are  received 


and  manipulated  separately  If  this  independence  doe* 
not  bold,  there  are  aeveral  alternatives.  The  most  direct 
it  to  modify  the  latch  signal  logic  so  that  under  test  the 
latch  signals  can  be  controlled  directly.  Note  that  only 
one  latch  signal  can  be  used  within  e  letch,  otherwise  the 
letch  will  not  be  completely  testable.  Of  course,  the 
logic  creating  the  modified  latch  eignal  must  be  tested. 
Another  alternative  is  to  test  only  some  latches  st  a 
time  For  as  ample,  there  mey  be  two  sets  of  latches, 
with  the  letch  signals  and  data  inputs  for  each  dtpen- 
dent  on  the  output  of  the  other  but  the  letch  signals  and 
data  inputs  for  each  latch  in  a  given  set  independently 
controllable.  Then  we  can  test  one  set  at  a  time  by 
appropriately  setting  the  outputs  of  the  other  set  Such 
a  method  requires  careful  analysis  of  the  dependencies 
of  the  circuit  and  brings  us  once  again  to  the  problems 
of  test  vector  generation  Tbus,  it  is  not  in  keeping  wit*: 
the  spirit  of  our  approach  However,  it  may  prove  to  be 
a  desirable  method,  especially  if  there  are  vary  few 
groups  of  latches  which  interact 

6.  Finding  Bipartite  Circuits 

Our  approach  rsquiret  that  the  combinational  logic 
sections  of  a  circuit  be  bipartite  In  [LaB3],  we  havr 
shown  that  any  circuit  can  be  transformed  into  a  bipar 
tite  version  by  at  worst  doubling  the  number  of  gates 
This  transformation  does  not  incraase  the  fan-out  or  the 
length  of  input /output  paths  We  also  require  the  use  of 
controllable  input  pads  similar  to  the  controllable  latch 
described  above  In  test  mode,  a  controllable  input  pad 
allows  black  wires  and  white  wires  from  the  pad  to  take 
on  opposite  values 

Since  the  cost  of  transforming  an  arbitrary  combi¬ 
national  circuit  into  a  bipartite  circuit  may  be  very  high, 
we  wish  to  identify  technologies  and  design  methodolo¬ 
gies  which  produce  bipartite  circuits  naturally  Two  com¬ 
mon  design  structures  which  are  bipartite  are  the  NOR* 
NOR  and  NAND-NAND  PLAs  (Note  that  each  gate  of  s  PLA 
may  have  high  fan-in,  requiring  a  proportional  number  of 
control  signals.)  PLA  s  are  in  fact  examples  of  level  logic 
logic  which  is  constructed  in  levels  of  gates  so  that  gates 
in  the  level  receive  inputs  from  gates  in  the  t-1* 
level  and  send  outputs  to  gates  in  the  t  +  l*  level  Any 
level  circuit  is  bipartite  It  is  interesting  to  note  that  to 
string  together  PLAs.  one  mutt  take  care  at  the  inputs 
Often,  a  PLA  uses  an  input  value  and  its  complement  as 
shown  in  Figure  4.  When  the  input  is  treated  as  a  circuit 
input  with  a  controllable  pad.  this  does  not  prssent  s 
problem  However,  if  the  input  is  to  coma  as  the  output 
of  some  other  PLA.  the  combined  circuit  may  not  be 
bipartite.  Instead  of  employing  the  circuit  transforma¬ 
tion.  one  may  use  a  pseudo  input  pad  between  the  two 
PLAs  so  that  the  input  value  and  its  complement  can  be 
decoupled  (take  on  the  same  value)  during  test.  This 
illustrates  one  important  theme  which  emerges  from  our 
test  approach; 

Allou,  variable r  and  lAetr  complements  to  assume 

values  independently  during  t*st 
The  usefulness  of  this  theme  has  been  noted  by  others  as 
well  (t  f  fSaB2j),  The  pseudo  input  ped  essentially 
breaks  a  "bad  *  edge  The  concept  of  breaking  "bad  " 
edges  will  be  expanded  upon  below  Note  that  the  result 
is  not  always  fully  testable 

A  design  methodology  which  produces  bipartite  cir¬ 
cuits  is  domino  CMOS  [Ho63]  In  domino  CMOS,  each 
composite  gate  is  actually  e  precharded  inverting  gate 
followed  by  e  standard  CMOS  inverter  These  composite 
gates  can  he  connected  in  any  feedback-free  fashion  to 


produce  e  bipartite  circuit  (color  all  preebarged  gates 
white  and  all  standard  inverters  black). 

The  transformation  presented  in  [La03]  is  advanta¬ 
geous  because  it  does  not  increase  the  fan-out  or  lengths 
of  paths  of  the  circuit.  Tbus  it  does  not  increase  the  two 
major  sources  of  delay.  However,  the  ares  penalty  for  a 
particular  circuit  can  be  very  large  A  circuit  may  be 
"almost"  bipartite  in  one  of  two  ways  Its  transforma¬ 
tion  may  require  only  a  small  number  of  extra  gates,  or 
there  may  be  a  black/wbite  coloring  for  it  in  which  few 
edgas  go  between  gates  of  the  same  color  (offending 
edges).  Note  that  the  latter  does  not  Imply  the  former, 
as  shown  in  Figure  5.  We  would  like  a  circuit  with  few 
offending  edges  to  have  a  low  cost  modification  to  a 
bipartite  version,  but  it  may  not  under  the  original 
transformation.  In  the  remainder  of  this  section  we  dis¬ 
cuss  ways  of  modifying  such  s  circuit  et  the  offending 
edges  90  that  the  parity  principle  can  be  used.  These 
techniques  will  not  incur  the  area  cost  but  have  other 
costs  It  should  be  noted  that  both  finding  s  minimum 
aise  transformed  circuit  and  finding  a  coloring  with  a 
minimum  number  of  offending  edges  are  computation¬ 
ally  difficult  (La.NP-womptete  [Go?B.  GaB2])  Therefore, 
we  do  not  expect  to  solve  either  problem  optimally  but 
to  find  good  solutions 

In  each  of  the  techniques,  we  will  conceptually 
insert  an  extra  gate  on  each  offending  edge  during  test 
to  that  the  edge  is  split  into  two  and  the  circuit  becomes 
bipartite  The  most  straightforward  solution  is  to  actu¬ 
ally  do  this  as  shown  in  r «  ure  6a  This  introduces  an 
extra  deley  during  normal  operation  of  one  pass  transis¬ 
tor  (transmission  gete)  on  each  offending  edge  Even 
worse,  the  normal  functioning  of  the  '  edge*'  cannot  be 
test**4  using  our  approe  h  Thus  this  solution  costs  both 
■peed  and  fault  coverage  but  has  a  email  area  penalty 
when  there  arc  few  off en din*  edges 

By  using  an  inverter  whose  output  can  be  isolated  as 
■bown  in  Figure  Bb.  part  of  the  testing  problem  of  our 
initial  "splitting"  method  can  be  alleviated  This 
inverter  must  have  a  pullup  which  will  not  dominate  the 
pulldown  paths  within  the  praceding  logic  gate  In  this 
respect,  it  is  designed  to  be  fault  tolerant  A  third  con¬ 
trol  signal  ia  required 

Two  alternatives  present  themselves  to  solve  the 
remaining  lack  of  fault  coverage  One  is  to  introduce  a 
second  path  for  normal  operation  (making  a  double 
edge)  and  through  this  redundancy  make  the  construc¬ 
tion  fault  tolerant  The  layout  of  such  a  construction 
with  redundant  paths  ia  very  important,  since  the  likeli¬ 
hood  of  both  paths  being  broken  must  be  low  This  alter¬ 
native  increases  the  area  penalty  at  each  offending  edge 
and  does  not  improve  the  fault  coverage 

The  second  alternative  it  to  modify  the  preceding 
logic  gete  so  that  we  beve  further  control  over  it  Note 
that  the  controllable  NOR  can  be  set  to  output  a  1 
rofardlets  of  the  values  of  its  normal  inputs  (the  con¬ 
trollable  NAND  can  be  set  to  output  a  0).  Suppose  we 
further  modify  the  NOR,  using  another  control  input,  to 
output  a  0  regardless  of  the  value  of  its  normal  inputs  If 
an  offending  edge  is  from  such  a  controllable  NOR.  we 
can  check  that  the  normal  connection  from  the  gate  to 
the  nest  gate  Is  working.  La.  that  the  "edge"  works 
correctly  in  normal  operabon.  To  do  this,  we  generate  e 
0  and  a  I  as  outputs  of  the  gate  independent  of  the 
gate's  normal  Inputs.  For  each  value,  we  select  the  nor¬ 
mal  connection  through  the  edge  and  observe  the  value 
of  the  edge  beyond  the  test  mode  circuitry  This  test  will 
Introduce  two  new  events  et  eucb  edges  This  alternative 
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inquires  •  more  complicated  gate  preceding  an 
offending  edge,  thus  further  Increasing  the  area  penalty. 
It  also  requires  more  test  vectors  and  observation  logic 
than  the  simple  strategy  for  "splitting'*  an  edge.  Again.  If 

edges  are  involved,  the  area  penalties  may  still  be 
less  than  using  the  original  transformation. 

The  techniques  described  above  all  "split"  an 
offending  edge  during  testing  so  that  the  parity  principle 
will  hold.  We  may  instead,  raroute  inputs  ao  that  each 
gate  receives  inputs  from  gates  of  the  correct  color  dur¬ 
ing  test.  To  do  this,  for  each  offending  edge,  we  intro¬ 
duce  an  alternate  edge  to  the  same  input  as  the 
offending  edge  but  from  an  arbitrary  gate  of  tin,  “jrrect 
color.  The  offending  edge  is  selected  for  normal  opera¬ 
tion  and  the  alternate  edge  is  selected  for  test.  This 
involves  the  addjtion  of  wires  and  pass  transistors  as 
shown  in  Figure  7.  As  for  our  simple  splitting  technique, 
a  pass  transistor  delay  is  added  to  each  offending  edge 
and  faults  in  the  pass  transistors  for  normal  operation 
are  undetected.  This  technique  does  not  require  the 
addition  of  an  inverter.  If  routing  the  alternate  connec¬ 
tions  is  easy,  very  little  additional  area  is  required  Note 
that  this  technique  tests  a  different  circuit  than  the  one 
desired,  hut  one  with  the  same  gates.  Thus  all  the  gates 
are  tested  but  not  all  of  the  connections  are  tested 

6  Concluding  Remarks 

We  have  presented  a  method  of  creating  easily 
testable  circuits,  focusing  on  the  method  as  applied  to 
CMOS  and  sequential  circuits  Our  method  requires 
bipartite  combinational  logic,  specially  designed  logic 
gates  and  latches,  and  the  addition  of  observation  logic 
The  special  components  and  observation  logic  could  be 
used  separately,  but  are  designed  to  work  with  the  bipar¬ 
tite  properties  to  produce  a  circuit  requiring  very  few 
test  vectors  We  have  also  presented  some  alternatives 
to  the  transformation  presented  in  [La83]  which  try  to 
minimize  the  area  penalty  of  making  a  circuit  bipartite 
for  testing  purposes 

We  have  concentrated  on  the  stuck-at  fault  model 
However,  in  addition  to  all  stuck-at  faults,  our  method 
will  catch  any  logic  gate  fault  which  causes  the  gate  to 
become  non-inverting.  Also,  any  bridging  fault  between  s 
white  wire  and  a  black  wire,  causing  both  to  take  on  the 
same  values,  will  be  delected  However,  such  bridging 
faults  between  wires  of  the  same  color  are  not  detected 
It  is  interesting  to  note  that  faults  causing  CMOS  logic 
gates  to  have  floating  outputs  are  difficult  to  deal  with  in 
conventional  testing  methods  since  a  large  sat  of  test 
vectors  must  be  sequenced  [CbB3]  In  our  method,  the 
small  number  of  test  vectors  allows  us  to  easily  sequence 
them  by  inspection  to  produce  tests  for  these  faults 
Another  advantage  of  the  method  is  the  ability  to  design 
self-testing  chips  using  it.  allowing  the  possibility  of  in¬ 
field  testing 

Given  a  circuit  which  has  bipartite  combinational 
logic,  there  remain  some  costs  in  area  and  speed  associ¬ 
ated  with  the  method  we  have  described  The  controll¬ 
able  gates  and  latches,  routing  of  control  signals,  and 
observation  logic  require  extra  area  The  controllable 
logic  gates  may  be  somewhat  slower  than  their  standard 
counterparts,  but  this  penally  can  be  minimized  during 
gate  design  (normally  at  the  expense  of  area)  The 
observation  logic  wiU  present  a  small  extra  load  on  each 
gate  whose  output  is  observed:  this  will  also  introduce 
delay  Consequently,  we  do  expect  a  circuit  designed  for 
this  testing  method  to  run  somewhat  Slower  than  if 
designed  without  the  testability  structures.  We  believe 


this  added  delay  will  be  email  However,  future  work 
comparing  versions  of  circuits  designed  with  and  without 
the  testability  structures  is  necessary. 

The  practicality  of  the  method  we  have  presented 
ultimately  depends  on  the  relative  values  of  chip  fabrica¬ 
tion  cost,  chip  performance,  and  cost  of  faulty  chips 
raaching  the  field  The  additional  circuitry  will  decrease 
the  yield  of  chips  designed  using  our  method  and  thus 
increase  the  eost  of  fabrication  The  higher  fault  cover¬ 
age  must  more  than  compensate  for  the  increase  in 
faulty  chips  to  decrease  the  number  of  faulty  chips  end¬ 
ing  up  in  the  field  We  believe  our  method  will  prove 
advantageous  when  high  confidence  in  chip  correctness 
is  required.  However  further  study  is  necessary  to 
determine  the  actual  penalties  and  gains  of  the  method 
in  various  design  domains 
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1.  SIGNAL  PROCESSING  AND  VLSI 

Many  signal  processing  algorithms  are  highly  regular, 
data-independent,  and  access  the  data  in  fixed  patterns.  For 
these  reasons  the  current  technological  advances  in  very  large 
scale  integrated  circuits  hold  especially  great  promise  for  sig¬ 
nal  processing,  and  in  fact  we  now  see  the  development  of 
many  highly  integrated  processors  of  a  more  or  less  special¬ 
ized  nature.  At  one  end  of  the  spectrum,  we  see  programm¬ 
able  signal  processing  chips  that  are  really  microprocessors, 
with  program,  memory  and  logic  separated  as  in  a  general- 
purpose  machine.  At  the  other  extreme,  we  see  highly- 
specialized,  custom  chips  that  perform  fixed  tasks;  typically 
the  data  moves  through  the  chip  along  fixed,  regular  paths, 
the  arithmetic  logic  is  distributed  in  space,  and  the  "program" 
is  really  "hard-wired"  into  the  topology.  This  talk  is  devoted  to 
a  study  of  this  latter,  custom  variety  of  architecture. 

The  range  of  algorithms  that  are  commonly  used  for  digital 
signal  processing  is  not  very  great;  a  few  very  important  algo¬ 
rithms  are  used  intensively.  They  fall  roughly  into  four 
categories,  convolution  and  filtering;  Fourier  transforms: 
matrix  calculations  (see,  for  example,  [36]),  and  iterative 
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algorithms  for  adaptive  filtering.  Ail  these  applications  are 
characterized  by  two  important  characteristics  that  make 
special-purpose,  highly  dense  hardware  very  attractive: 

•  high-volume,  real-  or  nearly  real-time  data-flow  require¬ 
ments,  and 

•  effective  algorithms  with  regular  patterns  of  data  access 
and  fixed  operation  sequences. 

There  are  direct  architectural  consequences  of  these 
characteristics.  The  regularity  of  the  patterns  of  data  access 
and  operation  sequences  makes  possible  a  high  degree  of  pipe¬ 
lining  and  multiplexing.  This,  in  turn,  makes  possible  a  high- 
volume  data  flow.  Furthermore,  the  regularity  of  the  algo¬ 
rithms  is  reflected  in  a  regularity  of  VLSI  circuit  structure,  so 
that  a  hierarchical  approach  to  layout  design  and  specification 
becomes  possible,  and  that  greatly  simplifies  the  design  and 
development  of  large-scale,  custom  VLSI  circuits  for  digital 
signal  processing.  The  rest  of  this  talk  is  devoted  to  these 
architectural  consequences:  In  the  next  section  we  will  discuss 
some  general  aspects  of  parallelism,  and  why  the  need  for 
parallelism  justifies  the  development  of  custom,  single- 
purpose  chips.  Section  3  is  devoted  to  highly-pipelined  ar.d 
systolic  structures,  using  filtering  as  an  example,  with  a  review 
of  some  useful  topologies.  Section  4  deeds  with  how  some  of 
the  important  structures  can  be  combined  in  a  hierarchical 
way  .  We  will  review  mathematical  models  of  VLSI  computation 
and  available  lower  and  upper  performance  bounds  in  Section 
5.  Section  6  will  deal  with  the  practical  matter  of  maximizing 
throughput  by  appropriate  choice  of  latching  density. 

2.  PARALLELISM  AND  THE  CUSTOM  /PROGRAMMABLE  DECISION 
There  is  a  general  tradeoff  between  specialization  and  pro¬ 
grammability  in  digital  signal  processing  chips.  The  obvious 
advantages  of  flexibility  afforded  by  programmability  must  be 
weighed  against  the  higher  potential  throughput  of  a  custom 
chip.  The  choice  between  the  two  is  dictated  by  the  cost  in 
time  and  money  of  designing  and  testing  a  chip,  and  by  the 
need  for  very  high-throughput  real-time  processing  This  tra¬ 
deoff  changes  with  time  and  technology:  as  chip  design 
becomes  a  highly  automated  process,  and  as  more  real-time, 
high-volume  applications  arise  (such  as  in  the  fields  of  com¬ 
munications  and  robotics),  we  are  likely  to  see  the  prolifera¬ 
tion  of  very  specialized  signal  processing  chips. 
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Figure  1.  Signal  Processing  Tasks  in  the  intensity-rate  plane 
(after  [7,9]). 

Figure  1  shows  one  way  of  looking  at  the  range  of  signal 
processing  applications.  We  plot  the  data  rate  in  bits /sec  as 
ordinate,  and  the  computational  intensity  requirement  of  a 
given  task  in  operations /bit  as  abscissa.  For  example,  a  low- 
order  FIR  filter  has  a  low  computational  intensity,  whereas  a 
high-order  filter  has  a  correspondingly  high  intensity.  A  50th- 
order  FIR  filter  at  a  sampling  rate  of  20,000  words /sec  (a  typi¬ 
cal  audio  rate)  and  10  bits /word,  with  say  100  logical  opera¬ 
tions  (at  the  gate  level)  per  fixed-point  multiply-by-constant, 
corresponds  to  a  data  rate  of  2xioP  bits/sec,  and  a  computa¬ 
tional  intensity  of  5xio2  operations /bit.  For  a  chip  of  fixed  size 
and  for  a  given  technology,  the  product  of  the  two  coordinates 
in  operations /sec  is  bounded  from  above  by  some  constant: 
the  total  number  of  operations  possible  if  every  piece  of  the 
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chip  were  doing  useful  work  all  the  time  For  a  chip  with  10s 
gates  and  a  clock  period  of  100  nsec,  this  is  :o12 
operations /sec.  This  boundary  is  shown  by  a  hyperbola  in  Fig 
1  (a  straight  line  in  log-log  coordinates).  At  the  same  Lime, 
there  is  an  upper  limit  on  the  data  rate,  determined  by  the 
number  of  1/0  pins  and  the  speed  of  the  I/O  drivers. 

We  are  thus  constrained  to  work  within  the  region  shown 
Whenever  an  operation  is  carried  out  that  does  not  contribute 
directly  to  the  processing  of  the  signal,  as  counted  by  the 
measure  of  computational  intensity,  we  move  away  from  the 
boundary.  Consider,  for  example,  the  operation  of  a  pro¬ 
grammable  signal  processor  with  a  stored  program.  Every 
instruction  fetch  or  store,  every  instruction  decoding,  and 
every  test  for  branching,  is  wasted  in  the  sense  that  a  part  of 
the  chip  is  doing  work  that  is  not  essential.  Also  wasted,  of 
course,  is  any  part  of  the  chip  that  remains  idle  during  any 
particular  clock  cycle. 

We  are  lead  to  the  conclusion  that  the  most  efficient  use  of 
chip  area,  the  dearest  resource  at  present,  should  avoid  pro¬ 
grammability,  and  should  make  concurrent  use  of  as  much  of 
the  chip  as  possible.  When  demands  on  performance  are  very 
high ,  at  the  limits  of  applications  technology,  we  are  lead  to 
the  design  ol  custom,  single-purpose  chips  with  fixed  data-flow 
paths.  Thus  some  filtering  tasks  at  audio  bandwidths  may  be 
best  implemented  now  with  programmable  chips,  but  applica¬ 
tions  at  video  rates,  like  robot  vision,  demand  custom  designs. 

In  this  talk  we  will  use  the  operation  of  convolution  for  our 
examples  It  is  no  doubt  the  most  widely-used  of  all  the  digital 
signal  operations  and  is  also  representative  in  terms  complex¬ 
ity  and  throughput  requirements.  We  write  it  as 

V  =  tu«x  =  £>***_*  =  £z*u*_*  (1) 

*  * 

The  function  tu*  will  be  called  the  weight  sequence ,  and  will 
usually  be  of  finite  duration,  so  that  the  limits  of  the  summa¬ 
tions  in  (1)  will  be  finite.  We  will  distinguish  two  situations  in 
which  convolution  is  usually  implemented:  general  convolu¬ 
tion,  where  the  weights  uk  are  variable  on  a  short-term  basis, 
as  fast  as  the  signal  and  filtering,  where  the  weights  uk  are 
fixed  (or  at  least  infrequently  changed).  We  will  make  no  dis¬ 
tinction  between  convolution  and  correlaiion,  which  is  simply 
convolution  with  one  of  the  signals  time-reversed. 

Convolution  can  be  applied  in  many  ways.  At  the  bit  level, 
with  Boolean  product,  and  Boolean  sum  with  carry,  it  means 


binary  multiplication.  At  the  signal  level  it  means  filtering  or 
correlation.  At  the  logical  level  it  means  pattern  matching 
This  observation  allows  us  to  develop  highly  regular  VLSI  topo¬ 
logies  by  first  developing  a  structure  at  the  word  level  for  con¬ 
volution.  A  similar  structure  is  then  used  recursively  to  build 
a  multiplier  at  the  bit  level.  The  result  is  a  hierarchical  struc¬ 
ture  that  is  highly  regular,  being  uniform  in  topology  all  the 
way  down  to  the  bit  level.  In  the  next  two  sections  we  carry 
out  just  this  plain. 

3.  SYSTOLIC  AND  COMPLETELY-PIPELINED  STRUCTURES 

Highly  concurrent  V1SI  circuits  can  be  characterized  by 
the  following  desirable  properties  [7,9]: 

•  Local-Connectedness  :  This  means  that  computational  ele¬ 

ments  are  connected  only  to  nearby  neighbors. 

•  Flow-Simplicity  :  This  means  that  each  element  is  used  only 

once  per  elementary  computation. 

•  Cell-Simplicity  :  This  means  that  each  element  takes  only 

constant  time  for  its  computation;  that  is,  its  computation 
time  does  not  depend  on  the  such  parameters  as  the 
number  of  bits  in  a  word,  or  the  number  of  coefficients  in  a 
filter. 

Systolic  arrays  [21,22,27]  can  be  characterized  as  those 
that  are  both  locally-connected  and  flow-simple.  Wires  are 
short  and  the  data  flows  through  the  structure  in  a  smooth 
way.  However,  each  “cell''  may  be  very  complex  (a  multiplier, 
possibly),  and  may  take  time  dependent  on  the  problem 
parameters. 

Completely-pipelined  circuits  [7,9]  are  smother  class  of 
highly  concurrent,  pipelined  circuits,  characterized  by  the 
properties  of  being  flow-simple  and  cell-simple .  These  circuits 
are  more  general  than  systolic  ones  in  that  long  wires  are 
allowed,  but  more  restricted  in  the  sense  that  the  computa¬ 
tional  cells  operate  in  time  independent  of  the  problem 
parameters  (such  as  word-size). 

In  what  follows  we  will  concentrate  on  the  simplicity  and 
regularity  of  some  computational  structures  and  ignore  some 
problems  that  are  important  at  a  practical  level,  but  which 
would  obscure  the  presentation.  For  example,  the  question  of 
efficient  use  of  area  will  be  ignored  for  now,  but  will  be  dis¬ 
cussed  in  Section  5.  The  problem  of  distributing  power, 
ground,  and  clock  lines  will  likewise  be  ignored;  some 
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discussion  of  these  points  in  the  present  context  can  be  found 
in  [9].  We  will  also  not  worry  about  the  signs  of  numbers  in 
describing  multiplication,  but  assume  that  extension  bits  are 
added  to  two’s-complement  numbers  so  that  the  answer  is 
always  in  range  (sec  [13,31,33]  for  some  discussion  of  this 
issue). 

3. 1  Word-Serial  Filterlrig 

Figure  2a  shows  the  conventional  signal  flow  graph  F  for 
FIR  filtering  (see  [37],  for  example):  the  input  signal  **  is 
delayed  along  a  chair,  of  registers,  and  during  each  clock 
period  the  appropriate  samples  are  multiplied  by  the 
corresponding  weights  Un-k  and  summed.  At  the  right  we  see 
the  computation  that  must  be  performed  every  clock  cycle. 
Notice  that  for  this  example  of  a  4-coefTicient  filter,  not  only 
do  we  need  to  perform  4  multiplications  and  3  additions 
between  clock  pulses,  but  the  input  of  the  second  addition 
depends  on  the  result  of  the  first,  and  the  input  of  the  third 
depends  on  the  result  of  the  second.  This  means  that  we  can¬ 
not  perform  these  additions  in  parallel,  and  therefore  that  the 
throughput  rate  is  limited  by  the  time  for  three  additions 

Figure  2b  shows  another  signal  flow  graph  F* ,  the  transpose 
(see  [34],  for  example)  of  signal  flow  graph  F  (called  Bl  in 
[2l]).  The  transformation  of  transposition  entails  reversing 
the  direction  o,  every  arc,  replacing  summing  nodes  by 
branching  nodes,  replacing  branching  nodes  by  summing 
nodes,  and  interchanging  input  and  output.  Here  the 
sequence  of  sums  is  replaced  by  a  broadcast  of  the  input  signal 
xk  at  any  time:  the  computation  during  each  cycle  is  again 
shown  at  the  right.  This  broadcasting,  or  fanout,  of  a  signal 
carries  with  it  a  certain  penalty  in  terms  of  delay,  but  is  gen¬ 
erally  much  faster  than  sequential  add  operations,  so  that  this 
signal  flow  graph  can  be  implemented  with  a  much  higher 
throughput  if  the  three  additions  are  implemented  with  three 
adders  operating  in  parallel  The  commercial  chip  described  in 
[47]  uses  this  transposed  structure. 

The  fanout  problem  of  graph  F*  and  the  sequential  delay  of 
graph  F  can  be  avoided  by  using  the  graph  F*.  shown  in  Fig.  2c. 
Here  on  every  clock  pulse  the  input  signal  moves  to  the  right, 
and  the  output  signal  moves  to  the  left.  The  transfer  function 
of  F*  is  H{z*)  if  the  original  graph  F  has  transfer  function  H{z), 
so  that  meaningful  output  is  obtained  only  every  other  clock 
cycle  (this  is  the  structure  W'l  of  [21]).  Thus  the  input  and 


-8- 


output  signals  must  be  interleaved  with  zeros  (or  two  indepen¬ 
dent  filtering  operations  can  be  interleaved). 

It  seems  that  there  is  no  way  to  avoid  all  the  difficulties 
with  a  single  structure.  For  example,  the  summing  nodes  of  F 
can  be  separated  by  registers  (delays),  and  corresponding 
extra  delays  inserted  between  inputs,  producing  the  circuit  F * 
shown  in  Fig  2d  (called  W2  in  [2l]).  Here  the  input  and  output 
signals  move  in  the  same  direction,  but  at  different  speeds. 
But  this  graph  has  more  registers  than  F  or  F* ,  and  has  a  delay 
before  the  output  appears. 

Finally,  Fig.  2e  shows  the  structure  F*  that  results  when 
the  additions  in  F  are  performed  using  a  binary-add  tree 
[7,9,37],  This  is  a  convenient  structure  for  visualizing  the  con¬ 
volution  operation,  and  may  be  useful  for  general  convolution 
(as  opposed  to  filtering).  Care  must  be  taken,  however,  that 
the  tree  is  laid  out  in  a  way  that  does  not  take  up  too  much 
area  on  the  chip.  The  recursive  configuration  of  an  H-tree  is 
useful  for  that  purpose  [32]. 

Are  the  preceding  structures,  by  our  terminology,  systolic 
or  completely-pipelined?  Graph  F  can  be  laid  out  to  be  locally- 
connected,  but  is  not  flow-simple  if  a  single  adder  is  used 
sequentially  (and  that  is  only  reasonable  since  multiple  adders 
would  not  be  usable  in  parallel).  Neither  is  it  cell-simple,  since 
the  time  for  the  elementary  computation  (that  between  clock 
pulses)  depends  on  the  number  of  coefficients  in  the  filter.  The 
structure  F  is  therefore  neither  systolic  nor  completely- 
pipelined. 

Signal  flow  graph  F*  is  not  locally-connected,  because  the 
length  of  the  longest  wire,  used  to  broadcast  x,  depends  on  the 
filter  order.  It  is,  however,  flow-  and  cell-simple,  so  it  is  by  our 
definition  completely-pipelined,  but  not  systolic. 

Graph  F*  is  both  systolic  and  completely-pipelined;  it  cam 
be  laid  out  so  as  to  be  locally-connected,  and  is  flow-and  cell- 
simple. 

Finally,  signal  flow  graph  F*  is  completely-pipelined,  but 
not  systolic,  since  any  layout  (including  H-trees)  will  have 
wires  whose  lengths  depend  on  the  filter  order. 

3.2  Bit-Serial  Multiplication 

The  same  structures  used  for  word-serial  filtering  can  be 
used  for  bit-serial  multiplication  by  a  constant,  with  the 
difference  that  each  summing  node  is  a  full  adder  with  three 


Figure  3.  M‘ :  The  structure  F*  adapted  for  bit-serial  rrultipli- 
cation. 

inputs  and  two  outputs  The  inputs  to  each  full  adder  are  the 
two  addend  bits  ana  the  preceding  carry  bit,  the  outputs  are 
the  sum  bit  and  the  carry  bit.  Figure  3  shows  the  multiplier 
corresponding  to  graph  F*\  it  is  really  no  more  than  a  simple 
implementation  of  the  ordinary  shift-and-add  elementary- 
school  multiplication  algorithm  We  will  denote  by 
and  M*  the  multipliers  correspond’ng  to  F.F.F2.  and  Ft, 
respectively. 

3  3  Word-Parallel  Filtering 

We  now  consider  word-parallel  filtering,  and,  in  the  next 
section,  the  corresponding  operation  of  bit-parallel  multiplica¬ 
tion.  Figure  4  shows  a  diamond  array  with  the  signal  x  enter¬ 
ing  from  the  top  left  and  the  filter  weights  from  the  top  right 
(when  the  weights  e^e  Axed,  they  need  not  be  transmitted 
through  the  array  as  shown  but  can  be  stored  in  place).  Notice 
that  the  signal  values  xk  corresponding  to  a  given  signal  are 
arranged  on  a  horizontal  line,  and  hence  skewed  in  time  so 
that  successive  values  enter  the  diamond  array  at  successive 
clock  pulses.  The  next  horizontal  line  will  have  another  signal 
in  it,  and  blocks  of  filtered  signals  emerge  from  the  bottom  of 
the  array  at  successive  clock  pulses  The  sides  of  the  diamond 
array  have  length  proportional  to  the  filter  order. 

Figure  5  shows  the  detail  of  a  node  of  the  array:  Each  node 
in  Fig.  4  contains  a  multiplication  by  a  weight  and  an  adder, 
and  each  arc  has  a  delay  element  (a  latched  register).  W'e  will 
call  the  overall  structure  FA  (for  Array  Filter). 
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3  4  Bit-ParaLlel  Multiplication 

As  before,  the  filtering  structure  becomes  a  multiplication 
structure  when  the  multipliers  are  replaced  by  Boolean  pro¬ 
duct  and  the  summers  by  full  adders  with  carries.  In  this  case 
the  carries  propagate  down  and  to  the  left,  which  direction 
corresponds  to  the  next-higher  bit  of  the  product.  An  extra 
triangle  is  needed  at  the  lower  left  so  that  the  carry  bits  can 
propagate  all  the  way  to  the  left  (see  [31],  for  example).  The 
reader  will  recognize  this  signal  flow  graph  as  nothing  more 
than  an  array  multiplier  (we  will  call  it  MA),  with  every  value 
latched  between  clock  pulses  (see  [31,37]).  This  parallel  multi¬ 
plication  structure  has  the  property  that  the  full  adders  at  the 
top  of  the  diamond  can  accept  inputs  without  extra  logic,  so 
that  the  multiplier  can  function  as  an  accumulator  as  well,  and 
this  fact  is  useful  in  FIR  filters  and  other  applications  [10. 15] 

This  hexagonally-connected  array  multiplier  is  locally- 
connected,  cell-simple,  and  flow-simple,  and  is  therefore  both 
systolic  and  completely  pipelined  by  our  definition. 

3  5  Other  Useful  Structures 

We  have  already  seen  the  linearly-connected  array  (in  all 
the  serial  examples),  the  hexagonally-connected  array  (m  FA 
and  MA),  and  the  leaf-connected  tree  [7,9]  (in  Ft  and  M*)  We 
now  mention  some  of  the  other  regular  topologies  that  ha\e 
proven  useful  in  constructing  computational  networks  for 
VLSI.  A  structure  called  cube-connected  cycles  is  used  in  [35] 
for  bit-parallel  multiplication.  A  tree-like  structure  can  be 
used  to  shorten  the  delay  (latency)  of  a  parallel  multiplier,  and 
the  resulting  structure,  called  a  mesh-of -trees  can  be  found  in 
[8,2b].  Leighton  also  discusses  an  analogous  topology  called  a 
tree-of-meshes  [25], 

We  have  seen  above  a  variety  of  different  topologies,  all  of 
which  perform  similar  computational  tasks.  Some  work  has- 
been  done  towards  developing  a  unified  treatment  of  computa¬ 
tional  structures  of  this  type,  and  showing  how  they  can  be 
expressed  conveniently  and  derived  from  each  other.  For  more 
about  such  mathematical  representations  see  [12,14,20,45,46]. 

4.  HIERARCHICAL  METHODOLOGY 

An  important  feature  of  the  regular  topologies  exemplified 
in  the  preceding  section  is  that  they  can  be  combined  in  a 
recursive,  or  hierarchical,  way.  The  most  obvious  application 
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of  this  idea  is  to  use  bit-ser.ai  multiplication  within  a  word- 
serial  filter,  yielding  a  bit-seria:,  word-serial  filter  that,  is  fully- 
pipelined.  On  each  cloc.K  pulse,  every  bit  mo^es,  every  piece  of 
hardware  (silicon)  is  usee,  ar.C  one  output  bit  appears.  Bit- 
serial  adders  are  needed  at  the  summing  nodes.  Such  struc¬ 
tures  have  been  discussed  widely  in  the  literature  recently 
[6,13,16,23,30],  and  are  attractive  at  this  time  because  a  rea¬ 
sonably  high-order  filter  can  fit  on  one  chip,  and  the  intercon¬ 
nection  problems  caused  by  high  pin  counts  are  greatly  allevi¬ 
ated  by  the  bit-serial  nature  of  the  computation. 

Suppose  for  illustration  that  the  multiplier  M 1  is  used 
within  the  filter  F* ,  res  dung  in  what  we  will  call  F'iM*).  Figure 
6  shows  a  schematic  representation  of  this  filter,  which  is  simi¬ 
lar  to  those  described  in  [6,13|.  In  theory,  then,  we  have  the 
ingredients  for  5  x  S  =  25  different  bit-serial,  word-serial 
filters,  all  of  which  have  slightly  different  timing  and  layout 
details 


Figure  6.  Recursive  use  of  the  structure  F*  :  Fi{Mt). 
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To  go  one  step  further,  we  can  combine,  serial  and  parallel 
structures.  For  example,  at  the  other  extreme  from  the  com¬ 
pletely  bit-serial  hi  ter  just  described,  we  can  assemble  FA{MA), 
producing  a  bit-parallel,  word-parallel  filter  —  one  that  pro¬ 
duces  a  completely  filtered  block  of  signal  samples  once  every 
clock  pulse.  (Now  we  need  bit-parallel  adders  at  the  summing 
nodes.)  Of  course,  the  amount  of  area  is  greatly  increased  over 
the  bit-serial  filter,  but  so  is  the  throughput. 

In  the  same  wav,  we  cculd  use  a  bit-serial  multiplier  within 
a  word-parallel  filter  (yielding  FA{Ml),  for  example),  which  pro¬ 
duces  a  bit-serial,  word-parallel  filter,  which  for  5-bit  words, 
produces  a  complete  block  of  filtered  samples  every  B  clock 
pulses.  With  only  the  6  different  structures  discussed  here, 
there  are  36  possibilities,  each  having  its  own  characteristics 
in  terms  of  layout  area  and  throughput. 

An  important  advantage  of  this  approach  to  VLSI  layout  is 
that  some  of  the  pi  obiems  associated  with  design  and  layout 
are  greatly  simplified,  since  the  overall  problem  is  broken 
down  into  natura’  pieces,  each  of  which  can  be  handled  in  rela¬ 
tive  isolation  Such  a  design  methodology  is  well-suited  to  the 
use  of  high-level  layout  languages  and  silicon  compilers  (see, 
for  example,  [  17. )  6.20,38] ) 

Another  advantage  of  the  hierarchical  approach  is  in  the 
crucial  but  often  neglected  area  of  testing.  Because  the  com¬ 
plexity  of  testing  arbitrary  circuits  can  grow  exponentially 
■with  the  size  of  the  circuit,  it  is  a  great  advantage  to  be  able  to 
break  a  circuit  into  blocks  whose  function  can  be  tested 
independently  of  the  other  blocks.  Much  the  same  approach 
has  proven  very  valuable  in  the  design  of  large  software  sys¬ 
tems.  Some  recent  results  in  the  testing  of  regular  bilateral 
arrays  can  be  found  in  [ IS, 40, 42]. 

5  MODELS  AND  BOUNDS  FOR  VLSI 

The  signal  processor  has  heretofore  been  concerned 
mainly  with  speed  of  operation.  High  throughput  on  a  general- 
purpose  computer  is  achieved  by  managing  time.  But  now  the 
designer  of  systems  has  a  new  resource  to  manage:  area.  It  is 
no  longer  sufficient  to  specify  a  sequence  of  instructions  for 
data  processing.  We  must  now  specify  a  geometric  layout.  Of 
course  the  requirements  of  high  speed  and  small  area  are 
mutually  conflicting  Consider,  for  example,  the  multipliers 
discussed  above  A  5-bit-serial  multiplier  like  M‘  will  generally 
have  a  throughput  rate  of  one  product  every  B  clock  pulses 
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before  the  answer  is  ready,  and  generally  takes  area  propor¬ 
tional  to  B.  On  the  other  hand,  a  ^-bit-parallel  multiplier  such 
as  Ma  has  a  throughput  rate  of  one  product  every  clock  pulse 
(once  the  pipeline  is  full),  but  area  proportional  to  B z.  Thus 
there  appears  to  be  a  conservation  law  at  work,  and  we  expect 
that  bounds  can  be  obtained  on  such  quantities  as  throughput 
per  unit  area.  We  will  describe  some  such  bounds  below. 

5.1  Some  Terminology 

We  need  to  define  some  important  terms  precisely.  First, 
the  delay  or  latency  T  of  e.  signal  processing  device  is  the  time 
between  the  arrived  of  the  first  bit  of  the  input  signa’  at  the 
input  port,  and  the  time  that  the  last  bit  of  the  answer 
appears  at  the  output  port.  This  is  the  usual  usage  of  the  term 
"computation  time."  But  in  many  signed  processing  applica¬ 
tions  we  are  concerned  more  with  the  throughput  rate  than 
with  the  delay.  We  define  the  time  between  successive  outputs 
with  pipelines  full  as  the  period  P  of  a  chip,  and  the  reciprcca 
of  the  period  as  the  throughput 

If  a  quantity  is  bounded  from  above  by  a  constant  multiple 
of  /(£?)  for  sufficiently  large  B.  where  B  is  any  parameter  of 
interest  (often  the  number  of  bits),  we  say  the  quantity  is 
OJ  J B))  So,  for  example,  the  array  multiplier  MA  requires  area 
0[BZ )  A  corresponding  lower  bound  is  written  0(5) 

5.2  A  VLSI  Model 

We  will  next  describe  a  mathematical  model  for  a  VLSI 
chip,  one  that  is  abstract  and  simple  enough  so  that  results 
can  be  proved  about  it,  but  one  that  is  also  realistic  enough  so 
that  the  results  provide  some  guidelines,  or  at  least  hints, 
about  reality.  The  model  we  describe  is  attributed  by  Vuillemin 
to  the  three  sources  [4,32,41];  this  and  similar  models  have 
been  used  by  many  others.  There  is  a  fairly  large  literature  on 
models  and  bounds  that  we  will  not  attempt  to  survey  com¬ 
pletely  here  (see,  for  exampie,  [1,2,4,5,24,26,28,35,39,41,43]). 

The  basic  premise  is  that  there  is  a  minimum  feature  size 
A*,,  and  minimum  delays  tu  and  rff,  dictated  by  the  technology. 
The  important  assumptions  are  then  that  a)  no  two  wires  can 
have  their  midpoints  closer  together  than  A^,  b)  every  logical 
unit  (such  as  a  gate)  must  have  area  at  least  a£,  c)  passing  a 
signal  through  a  wire  entails  a  delay  of  at  least  r*,,  indepen¬ 
dent  of  the  wire ‘s  length ,  and  d)  passing  a  signal  through  a 
gate  entails  a  delay  of  at  least  rt. 
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As  discussed  in  [2],  the  assumption  that  the  delay  is 
independent  of  wire  length  is  true  only  in  certain  regimes. 
Depending  on  the  technology,  the  time  for  propagation  of  sig¬ 
nals  may  be  independent  of  wire  length,  as  we  assume  here 
(the  synchronous  model  [4]),  or  proportional  to  the  logarithm 
of  the  wire  length  (using  repeaterc),  or  proportional  to  the 
square  of  the  wire  length  (diffusion  case). 

5.3  Lower  Bounds 

The  essential  result  is  expressed  in  a  nicely  general  form 
by  Vuillemin  [43].  He  defines  a  wide  class  of  functions  called 
transitive  functions,  which  includes  integer  product,  convolu¬ 
tion,  linear  transform,  and  matrix  product. 

Theorem  [43]:  Any  circuit  that  computes  a  transitive  function 
has  wire  area 

A  =  0  (Dz)  (2) 

where  D  is  the  data  rate  in  bits /sec. 

The  period  P  is  related  to  the  data  rate  D  in  a  simple  way. 
P-N/D,  where  N  is  the  number  of  input  bits.  Therefore  the 
bound  above  can  also  be  written 

ap 2  =  n[Nz)  (3) 

Y»e  can  also  observe  that  if  any  one  of  A  input  bits  can 
affect  the  output,  and  if  there  is  a  constant  bound  on  the 
allowed  fan-in,  then  the  circuit  must  have  at  least  log.v  stages. 
This  implies  the  following  lower  bound  on  the  latency  T: 

T  =  n(logJV)  (4) 

A  good  example  to  illustrate  the  use  of  these  bounds  is  13- 
bit  multiplication.  The  bounds  above  tell  us  that  AP*  =  ft(5z). 
and  T  =  n(logF).  The  array  multiplier  described  above  has 
A  =  0(BZ),  p  =  0(1),  and  T  =  0(B).  It  is  therefore  asymptotically 
optimal  under  the  measure  AP*,  but  possibly  has  more  delay 
then  necessary.  In  fact,  the  delay  cam  be  reduced  to  the 
asymptotically  optimal  O(logB)  with  the  area  increasing  only 
from  0(BZ)  to  0(Bz\ogB)  [8]  (see  also  [44]). 

An  interesting  measure  of  goodness,  that  takes  into 
account  both  period  and  latency,  is  AP*1*.  By  the  arguments 
above  the  lower  bound  on  multiplication  is  then 
APZTZ  =  n(52log8B).  The  array  multiplier  mentioned  above  [8] 
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has  the  upper  bound  AP*TZ  =  0(BzlogsB ),  which  is  therefore  no 
more  them  one  log-factor  away  from  asymptotic  optimality,  by 
this  measure  at  least. 

Compare  this  result  with  the  bit-serial  multiplier  M‘,  for 
exeimple.  That  structure  has  area  A  =  0(B),  period  P  =  0(B),  and 
latency  T  =  0(B),  so  that  AP *1*  =  o(Bs).  This  is  an  indication  that 
the  overall  efficiency  of  silicon  utilization  is  not  as  good 
asymptotically  as  that  of  the  parallel  array  multipliers,  but 
one  should  not  conclude  too  much  from  this  argument.  For 
one  thing,  we  may  need  a  multiplier  with  small  area  simply 
because  an  array  multiplier  will  not  fit  on  a  chip,  in  which  case 
we  must  settle  for  the  possible  instead  of  the  asymptotically 
good.  It  is  also  quite  likely  that  the  constants  of  proportional¬ 
ity  favor  the  simple  linearly-arranged  designs  such  as  ,  so 
that  for  reasonably-sized  B  the  measures  above  may  be 
misleading.  The  asymptotic  measures  give  us  useful  guidelines 
for  comparing  designs  that  are  similar,  but  there  are  so  many 
factors  in  choosing  a  multiplier  for  a  particular  application  at 
a  particular  point  in  technological  development,  that 
mathematical  analysis  should  be  interpreted  with  caution. 

6.  PIPELINING  AND  LATCHING  FOR  THROUGHPUT 

In  the  designs  discussed  above  it  was  assumed  that  every 
signal  value  was  leached  (that  is,  held  in  a  register)  at  every 
stage  in  the  signal  flow  graph.  So,  for  example,  the  array  multi¬ 
plier  MA  has  a  register  after  every  full  adder  (this  was  stressed 
in  [31]).  This  means  that  every  part  of  the  circuit  can  be  used 
for  holding  intermediate  results  —that  every  part  can  function 
as  a  pipeline.  This  approach  leads  to  high  throughput  at  the 
expense  of  delay.  In  contrast,  array  multipliers  that  are  com¬ 
mercially  produced  on  packaged  chips  do  not  generally  have  a 
high  degree  of  pipelining  in  this  sense  of  the  term;  the  answer 
is  usually  produced  in  one  or  two  clock  cycles,  and  the  carry 
signals  ripple  through  the  structure,  settling  in  time  for  each 
new  clock  pulse.  Tn  [3],  for  example,  an  array  multiplier  with 
combinational  logic  that  is  113  gates  deep  is  mentioned.  Thus, 
commercial  single-package  multipliers  are  optimized  for 
latency  and  not  throughput,  and  are  therefore  not  necessarily 
"fast”  for  custom  chip  designs  for  signed  processing  applica¬ 
tions. 

However,  latching  at  every  possible  stage  of  a  circuit  does 
not  necessarily  lead  to  the  highest  throughput.  First,  the 
latches  themselves  take  time  to  operate;  their  input  stages 


Rgure  7.  Area,  period,  ana  area-period  product,  as  functions 
of  the  number  of  intermediate  latching  stages  m,  in  a  typical 
I  pipelined  array  multiplier  (after  [11]). 

must  be  charged,  and  they  must  charge  the  input  stages  of  the 
next  layer  of  logic.  Second,  the  clock  driver  must  drive  the 
additional  capacitance  of  the  latches,  and  for  a  giver,  driver 

I  this  lengthens  the  clock  rise-  and  fall-times  and  decreases  the 

possible  clock  rate. 

If  we  start  with  one  stage  of  a  circuit  that  has  combina¬ 
tional  logic  that  is  many  gates  deep,  and  we  introduce  m  inter¬ 
mediate  stages  of  latching,  we  decrease  the  period  by  a  factor 
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of  roughly  l/m,  up  to  the  point  that  the  latching  and  clock¬ 
driving  time  becomes  comparable  to  the  propagation  time  of 
signals  through  one  stage  o*  logic.  After  that  point  there  are 
diminishing  returns  to  the  addition  of  more  latching.  In  [ll] 
the  generic  situation  of  a  block  of  combinational  logic  is 
modeled  mathematically,  and  the  optimal  choice  of  the 
number  of  additional  latches,  m,  was  studied.  Figure  7  shows 
typical  curves  of  area,  period,  and  area-period  product  as  a 
function  of  m  for  a  circuit  with  a  depth  of  100  gates.  As  car;  be 
seen,  the  period  as  a  function  of  m  decreases  sharply  to  a 
minimum  and  stays  almost  constant,  the  area  increases 
steadily  with  m,  and  the  pr  oduct  has  a  well-defined  minimum. 
"We  may  wish  to  minimize  this  product  AP  instead  of  the  period 
P\  1/ AP  can  be  written  as  \‘/P)/A  —  the  throughput-per^unit- 
area.  In  any  case,  the  optimal  values  of  m  for  minimizing  P  or 
AP  will  be  close  to  each  other. 

A  typical  example  oi  such  a  situation  occurs  in  the  imple¬ 
mentation  of  the  bit-parallel  array  multiplier  MA  discussed  m 
Section  3.4.  Here  the  ana  /sis  predicts  that  the  period  of  a 
16-bit  array  multiplier  car.  be  decreased  from  210  nsec  to  66 
nsec  by  the  addition  oi  5  stages  of  intermediate  latches,  with 
an  attendant  increase  in  arc-a  of  only  13%. 


7.  CONCLUSION'S 

The  design  and  development  of  custom  chips  for  signal 
processing  tasks  is  very  challenging,  calling  as  it  does  on  the 
signal  processing  expert  to  make  decisions  at  many  design  lev¬ 
els:  he  must  manage  overall  system  architecture,  circuit 
topology,  timing,  area  utilization,  and  layout.  At  the  same 
time,  making  good  use  o'  such  resources  can  lead  to  reliable 
low-cost  devices  that  have  very  high  throughput  in  many  signal 
processing  applications. 

The  key  to  effective  design  is  a  high  degree  of  pipelining 
using  regular,  repetitive  structures  and  fixed  data-flow  paths 
Such  structures  can  be  hierarchically  organized,  making  the 
design  and  layout  problems  manageable. 
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ABSTRACT 


Th  is  paper  argues  the  case  for  a  computer  with  massive 
amounts  of  primary  storage,  on  the  order  of  billions  of  bytes.  We 
argue  that  such  a  machine,  even  with  a  relatively  slow  processor, 
can  outperform  all  other  supercomputers  on  memory  bound  compu¬ 
tations.  This  machine  would  be  simple  to  program.  In  addition,  it 
could  lead  to  new  and  highly  efficient  programs  which  traded  the 
available  space  for  running  time.  We  present  a  novel  architecture 
for  such  a  machine,  and  show  how  it  can  lead  to  reduced  memory 
access  times. 


Note:  An  extended  version  of  this  paper  has  been  submitted  to  the 
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1.  INTRODUCTION. 

This  paper  argues  the  case  for  a  computer  with  a  primary  memory  substan¬ 
tially  larger  than  what  is  currently  (or  will  be  in  the  near  future)  available  on  a 
single  machine.  We  do  not  have  a  specific  target  size  for  such  a  massive  memory 
machine  (MMM),  but  for  arguments  sake  let  us  say  we  want  a  few  billion  bytes  of 
main  physical  memory.  This  size  is  certainly  larger  than  what  any  manufacturer 
offers  today,  or  will  probably  offer  in  il..  near  future.  Our  thesis  is  that  such  a 
MMM  is  justified,  even  today,  by  the  importance  of  certain  applications  in  which 
memory  bound  compulations  occur  naturally.  For  these  computations,  a  classic 
Yon  Neumann  machine  with  a  relatively  slow  processor  and  massive  amounts  of 
physical  memory,  would  vastly  outperform  even  the  “supercomputers”  currently 
being  researched  and  would  be,  in  addition,  far  easier  to  program. 

In  Section  2  we  present  the  case  for  a  MMM,  including  its  economic  feasibil¬ 
ity.  In  Section  3  we  then  discuss  how  an  efficient  MMM  could  be  built. 

2.  THE  CASE  FOR  A  MMM. 

Research  efforts  in  the  supercomputer  field  have  tended  to  concentrate  at  the 
computational  intensive  end  of  the  spectrum,  disregarding  the  memory  intensive 
applications  altogether.  The  typical  supercomputer  being  investigated  today  is  a 
multiprocessor  having  up  to  one  million  processors,  capable  of  executing  up  to 
billions  of  operations  per  second  and  yet  have  as  “little"  as  sixty  four  megabytes 
of  physical  memory  [Comp80,  Comp81,  Comp82,  Evan82]. 
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There  are  many  applications  for  which  such  a  machine  (as  well  as  any  con¬ 
ventional  machine)  would  be  limited  by  its  disk  to  memory  transfer  rates.  For 
example,  consider  a  program  which  accesses  a  four  gigabyte  (4X10g  bytes)  data 
structure  with  an  essentially  random  pattern.  A  machine  with  one  hundred  or 
less  megabytes  of  memory  can  be  expected  to  generate  a  page  fault  in  just  about 
every  memory  access,  rendering  its  potential  processing  power  meaningless  as  a 
measure  of  its  performance. 

More  precisely  let  us  compare  such  a  supercomputer  with  one  hundred  mega¬ 
bytes  of  memory  and  a  MMM  with  four  gigabytes  of  memory  .  Further,  let  us 
assume  that  the  supercomputer  is  “infinitely  fast”  while  the  MMM  runs  only  at 
one  MIPS  (Million  Instructions  per  Second).  Of  course  the  supercomputer  will 
vastly  out  perform  the  MMM  on  compute  bound  tasks.  However,  for  the 
memory  bound  program  we  are  discussing,  assume  that  the  supercomputer 
creates  a  page  fault  every  /  instructions  ,  and  that  its  disks  aTe  capable  of  sen  ic¬ 
ing  100  instructions  a  second.  Then  on  this  task  the  MMM  still  computes  at  its 
one  MIPS  rate  while  the  supercomputer  is  reduced  to  computing  at  about  100/ 
instructions  a  second.  Clearly  if  /  is  small  enough  the  MMM  will  be  faster  than 
the  supercomputer:  if  /  is  about  100  then  the  speedup  is  100:1!  While  not  all 
tasks  will  cause  the  supercomputer  to  “thrash”  in  this  way,  we  believe  that  there 
are  a  large  collection  of  important  tasks  that  will  cause  such  behavior. 


2.1  Applications. 

An  MMM  will  produce  significant  improvements  for  any  task  which  refer¬ 
ences,  in  a  relatively  random  fashion,  a  large  address  space.  Here  we  will  review 
three  areas  in  which  such  tasks  abound,  but  this  list  is  by  no  means  exhaustive. 

(a)  Databases  [Date81,  Wied77].  It  is  well  known  that  many  database  applica¬ 
tions  are  IO  bound,  that  is,  limited  by  the  speed  at  which  data  can  be 
transferred  from  disks.  Clearly,  if  the  entire  database  (or  a  substantial  frac¬ 
tion)  could  reside  in  main  memory,  then  the  IO  bottleneck  would  be  elim¬ 
inated. 
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Not  only  will  existing  queries  be  answered  faster,  but  it  will  now  be  possible 
to  pose  new  interesting  queries  that  previously  required  unreasonable  times 
to  answer.  Thus,  users  can  get  more  useful  information  out  of  the  system. 

(Reliability  may  be  a  problem  in  a  massive  memory  database.  We  will 
return  to  this  issue  later.) 

(b)  VLSI  Design  [Mcad80].  The  size  of  VLSI  circuits  being  designed  is  growing 
at  a  fast  rate.  Today  there  are  circuits  with  a  half  million  transistors,  and 
predictions  of  integrated  circuits  with  as  many  as  one  hundred  million 
transistors  by  the  mid  90’s.  VLSI  design  tools  will  perforce  deal  with  mas- 
sive  amounts  of  data,  notwithstanding  much  cleverness  in  the  use  of 
hierarchical  design  and  the  encoding  of  information. 

Many  of  the  VLSI  design  algorithms  have  good  asymptotic  running  times, 
but  have  very'  poor  locality  of  reference.  Thus,  they  are  naturally  candidates 
for  an  MMM.  For  example,  a  layout  system  we  have  designed  [Lipt82]  uses 
topological  sorting  for  placing  objects.  The  algorithm  for  sorting  requires 
linear  time,  but  unfortunately  also  requires  linear  space  and  has  almost  no 
locality.  Thus,  beyond  a  certain  layout  size,  its  actual  running  time  is  deter¬ 
mined  by  the  memory  available:  at  a  given  point,  increasing  the  layout  size 
by  30^0  sends  our  computer  into  uncontrolled  thrashing  and  increases  the 
running  time  ten  fold! 

(c)  Artificial  Intelligence  [Nils80,  Wins77].  The  concept  of  vast  data  struc¬ 
tures  built  mainly  by  the  use  of  pointers,  and  hence  lacking  any  locality  of 
reference  when  accessed,  brings  the  words  "Lisp ’’and  artificial  intelligence 
(AI)  to  mind  immediately.  Garbage  collection  [CoheSl]  and  paging  times 
contribute  substantial  fractions  to  the  total  running  times  of  many  AI  pro¬ 
grams.  It  seems  fair  to  say  that  a  good  fraction  of  AI  research  involves 
memory  bound  computations. 

Certain  AI  programs,  such  as  DENDRAL  [Buch78]  or  MACSYMA  [Mart71], 
have  succinct  inputs  and  generally  produce  succinct  outputs,  and  yet  may 
build  enormous  intermediate  data  structures.  These  programs  are  even 
better  suited  to  a  MMM  than  others.  They  would  not  even  need  to  incur  in 
the  overhead  of  loading  the  massive  memory  as  a  data  base  or  VLSI  program 
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2.2  The  economical  feasibility  of  a  MMM 

Clearly  VLSI  has  made  computing  in  general  cheaper.  It  is  also  clear, 
although  not  as  well  understood  by  everybody,  that  VLSI  has  made  certain  kinds 
of  computing  cheaper  than  others.  One  example  of  this  differential  impact 
involves  memory  and  processing  power:  over  the  past  few  years,  the  price  of  logic 
circuits  has  decreased  about  20°^  per  year;  during  that  same  span,  memory  prices 
have  decreased  at  twice  that  rate:  almost  40°c  per  year.  Clearly  that  trend,  if 
continued,  should  be  very  good  news  indeed  for  applications  that  require  memory 
bound  computations. 

In  fact,  there  are  good  reasons  to  believe  that  the  figures  given  in  the  previ¬ 
ous  paragraph  represent  more  than  a  local  kink  in  the  prices  of  these  commodi¬ 
ties,  brought  about  by  a  vicious  fight  for  market  share  in  a  particularly  impor¬ 
tant  market.  Memories  are  the  most  regular  integrated  circuits  (ICs),  and  thus 
among  those  which  would  profit  immediately  from  higher  fabrication  densities. 
We  believe  that  memories  will  be  always  the  first  circuits  to  profit  from  progress 
in  integrated  circuit  manufacturing  technology  . 

At  today’s  prices,  the  cost  of  the  ICs  necessary  to  build  a  one  gigabyte 
memory  is  below  one  million  dollars.  This  is  not  out  of  proportion  with  the 
investment  necessary  to  equip  a  state  of  the  art  installations  for  research  or  pro¬ 
duction  w’ork  in  some  of  the  areas  identified  earlier.  Furthermore,  if  the  price 
trends  hold,  the  ICs  necessary'  to  build  a  four  gigabyte  memory  would  cost 
approximately  200,000  dollars  by  the  end  of  the  present  decade. 


2.3  New  Programming  Techniques. 

A  MMM  is  straightforward  to  program.  Existitg  programs  can  be  run  on  it, 
and  if  they  are  memory'  intensive,  they  will  run  very  fast.  However,  the  impact 
of  a  MMM  may  be  even  more  far  reaching  A  MMM  may  alter  the  way  we  pro¬ 
gram  and  this  in  turn  may  yield  even  grater  improvements  [Gray83,  Wein83] 
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For  example,  consider  the  concurrency  control  mechanism  of  a  database  sys¬ 
tem.  Since  user  programs  (called  transactions)  encounter  long  delays  as  they  wait 
for  disk  pages  to  be  brought  into  main  memory,  the  database  system  executes 
several  transactions  concurrently.  Since  the  transactions  are  not  independent 
(they  are  reading  and  writing  the  same  database),  their  actions  cannot  be  inter¬ 
leaved  in  arbitrary  ways.  The  concurrency  contro,  mechanism  (typically  using 
locking)  ensures  that  only  interleavings  that  preserve  data  consistency  are  run. 
Very  roughly,  about  10 %  of  the  CPU  instructions  are  spent  doing  concurrency 
control. 

When  the  database  system  is  transferred  to  a  MMM,  the  disk  delays  disap¬ 
pear,  and  concurrency  control  may  no  longer  be  needed.  The  data  required  by 
each  transaction  is  already  in  memory,  so  if  transactions  are  short  (as  they  are  in 
many  commercial  systems)  they  can  simply  be  scheduled  sequentially.  So  in 
addition  to  making  data  available  faster,  a  MMM  may  eliminate  the  overhead  of 
concurrency  control. 

In  general,  having  massive  amounts  of  memory  will  change  our  programming 
techniques.  Data  structures  for  secondary'  storage  (e.g.,  B-trees,  extendible  hash¬ 
ing)  will  become  obsolete.  Table  lookup  will  be  practical  in  many  more  cases. 
For  instance,  instead  of  computing  trigonometric  functions  with  a  series,  we  may 
want  to  have  a  large  table  of  values  and  use  simple  interpolation.  Digital  search¬ 
ing  [Knui73],  which  improves  search  times  at  the  expense  of  memory  space,  will 
be  commonplace. 


3.  THE  ESP  ARCHITECTURE. 

We  have  argued  that  main  memory  is  a  useful  resource  in  many  applica¬ 
tions,  and  that  a  computer  with  massive  amounts  of  memory  (e.g.,  gigabytes)  is 
economically  feasible. 

But  are  there  any  technological  challenges  in  building  a  MMM?  Is  it  not  just 
a  matter  of  connecting  all  the  desired  memory  to  the  chosen  processor  in  a  con¬ 
ventional  way,  i.e.,  with  a  very  long  bus?  (See  Figure  1.) 
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Fig.  1:  A  Conventional  Architecture  MMM 

A  conventional  architecture  is  a  reasonable  one,  but  as  we  will  discuss 
shortly  there  are  other  architectures  that  may  be  superior.  The  conventional 
architecture  has  two  main  weaknesses;  memory  access  times  and  reliability. 

•  Memory  access  times.  Given  current  IC  densities,  a  four  gigabyte 
memory  requires  about  one  thousand  devices  (memory  cards)  on  a  single  bus. 
Even  with  clever  arrangements  and  higher  densities,  hundreds  of  devices  per 
bus  seem  unavoidable.  Building  a  special  purpose  bus  to  support  that  many 
devices  is  feasible,  although  not  trivial.  However,  regardless  of  how  the  bus 
is  implemented,  as  the  size  of  the  memory  grows,  the  access  times  grow 
because  of  the  physical  distances  and/or  capacitance  effects.  At  the  same 
time,  memories  are  becoming  faster,  so  that  the  larger  access  times  make  us 
lose  part  of  the  advantage  of  having  a  massive  memory. 

•  Reliability.  As  the  size  of  the  memory  grows,  the  probability  that  one  of 
its  components  fails  also  grows.  A  conventional  architecture  has  no  provi¬ 
sion  for  graceful  degradation,  and  hence  the  entire  machine  would  be  una¬ 
vailable  with  unacceptably  high  probability.  For  database  applications, 
some  type  of  memory  redundancy  is  also  necessary  in  order  to  avoid  loss  of 
data. 

In  the  rest  of  this  section  we  present  a  new  architecture  which  directly 
addresses  the  first  of  these  weaknesses  The  reliability  issues  are  briefly  discussed 
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in  the  conclusions  section,  and  in  a  separate,  more  detailed  report  (Garc83b]. 

3.1.  A  Novel  Architecture. 

Our  basic  premise  is  that  the  time  to  access  memory  over  a  long  bus  (i.e., 
one  that  drives  hundreds  of  devices)  is  substantially  larger  than  the  access  time 
over  a  short  bus  (i.e.,  one  driving  a  single  memory  board).  The  meaning  of  “sub¬ 
stantially”  depends  on  how  the  buses  are  implemented,  but  for  the  time  being  let 
us  assume  that  access  times  over  a  long  bus  are  at  least  an  order  of  magnitude 
larger  than  over  a  short  bus. 

A  classical  solution  for  improving  access  times  over  a  long  bus  is  to  add  a 
memory  cache  [Kapl73,  Smit82]  to  the  processor.  (See  Figure  2.)  The  idea  is  that 
commonly  accessed  data  reside  in  the  cache,  and  are  hence  available  with  smaller 
delays  (both  because  the  cache  bus  is  shorter  and  because  the  cache  memory  is 
generally  faster).  Unfortunately,  caching  does  not  improve  access  times 
significantly  for  the  programs  we  have  in  mind.  A  cache  may  be  useful  for  hold¬ 
ing  some  commonly  accessed  values,  but  as  discussed  in  Section  2,  we  are  con¬ 
cerned  with  programs  that  reference  their  data  structures  in  essentially  random 
ways.  Thus,  for  most  of  the  recently  referenced  data,  the  probability  of  being 
accessed  next  is  low. 


Fig.  2:  A  MMM  with  a  Cache 
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If  we  cannot  bring  the  data  to  the  processor  as  fast  as  we  would  like,  we 
could  instead  ‘‘take  tbe  processor  to  the  data”.  This  is  precisely  what  the  ESP 
MMM  does.  A  schematic  description  of  it  is  shown  in  Figure  3.  (The  name  ESP 
will  be  explained  shortly.) 


Fig.  3:  The  ESP  MMM 

The  ESP  MMM  consists  of  a  collection  of  standard  Yon-Neumann  machines, 
interconnected  by  a  system-wide  (or  global)  bus  that  permits  the  broadcast  of 
values  from  one  machine  to  all  the  others.  Each  individual  machine  has  its  own 
processor  and  local  memory  connected  via  a  local  (short)  bus.  The  gateway  of 
each  machine  to  the  globr.1  bus  is  an  ESP  device  connected  both  to  the  system 
bus  and  tbe  local  bus.  (The  number  of  machines  is  not  critical  to  the  architec¬ 
ture,  but  we  expect  a  system  with  a  few  gigabytes  to  have  a  relatively  small 
number  of  machines,  possibly  up  to  one  hundred.  This  means  that  each  indivi¬ 
dual  machine  has  a  substantial  amount  of  memory.) 
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The  individual  processors  share  the  same  address  space.  This  address  space 
is  distributed  among  the  local  address  spaces  as  follows  (see  Figure  3).  A  small 
fraction  of  the  global  address  space  is  replicated  in  each  local  address  space;  the 
remainder  of  the  system  address  space  is  covered  in  a  non-overlapping  manner  by 
the  local  address  spaces.  An  ESP  device  connected  to  each  local  bus  is  responsi¬ 
ble  for  servicing  requests  that  involve  non-local  addresses. 

Even  though  the  ESP  MMM  has  multiple  processors,  it  is  a  single  instruction 
stream,  single  data  stream  machine  (SISD)  [Flyn72].  All  processors  execute  the 
tame  program,  which  is  loaded  into  the  replicated  portion  of  the  system  address 
space.  As  long  as  that  program  references  locations  in  the  shared  subspace  all 
processors  will  execute  in  lockstep  and  no  communication  through  the  system  bus 
will  take  place.  References  outside  the  shared  address  space  are  broadcast  and 
received  on  the  global  bus,  as  is  illustrated  by  the  following  example. 

Consider  a  program  which  references  memory  words  ti>j  through  wg.  Assume 
that  ti’j ,,  w6,  w?  are  in  machine  2,  and  the  rest  of  the  words  in  machine  3.  Figure 
4  shows  the  time  at  which  each  processor  receives  a  referenced  word.  In  this 
figure  we  assume  that  fetching  a  word  from  local  memory  takes  one  time  unit, 
and  that  broadcasting  a  word  over  the  system  bus  takes  two  units.  (We  choose 
two  units  only  to  simplify  the  example.  As  discussed  earlier,  we  expect  the  sys¬ 
tem  delays  to  be  orders  of  magnitude  larger  than  the  local  ones.) 

At  time  0,  all  processors  start;  since  they  all  run  the  same  program,  they  all 
request  word  Uj.  Processor  3  has  te,  locally,  so  one  time  unit  later  it  receives  it. 
From  then  on,  processor  3  works  at  full  speed,  accessing  words  w2,  tv3,  and  w4. 
At  time  4,  processor  3  requests  word  w6,  but  since  it  is  not  local,  a  delay  ensues. 

In  the  meantime,  the  ESP  at  machine  3  has  been  broadcasting  words  wt 
through  u>i.  Word  wl  arrives  at  processors  1  and  2  at  time  3,  and  the  following 
words  arrive  at  one  unit  intervals.  Note  that  the  words  are  “pipelined"  on  the 
bus,  so  that  there  is  only  one  system  bus  end-to-end  delay  involved.  Hence,  after 
the  initial  delay,  processors  1  and  2  start  receiving  and  processing  the  words  at 
full  speed. 

During  this  time  we  say  that  processor  3  “has  the  lead”,  i.e.,  is  ahead  of  the 
others.  But  when  processor  2  references  «t^,  it  finds  this  word  in  its  local  memory 
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and  takes  the  lead.  The  other  processors  must  now  wait  until  the  ESP  at 
machine  2  broadcasts  and  the  following  words.  In  a  similar  fashion,  the  lead 
changes  back  to  processor  3  when  t is  referenced. 
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Fig.  4:  Execution  in  an  ESP  MMM 


In  summary,  an  ESP  examines  each  word  request  made  by  its  local  proces¬ 
sor.  If  the  address  refers  to  the  shared  subspace,  the  ESP  does  nothing.  If  it 
refers  to  the  local  ncn-replicated  memory,  then  the  ESP  reads  the  fetched  word 
off  the  local  bus  and  broadcasts  it  over  the  system  bus.  In  case  of  a  reference  to 
remote  memory,  the  ESP  waits  for  the  next  word  broadcast  over  the  system  bus, 
and  then  places  it  on  the  local  bus.  (This  is  why  we  picked  the  name  “ESP"  for 
these  controllers:  the  remote  words  required  appear  on  the  system  bus  without 
having  been  requested,  as  if  the  controllers  has  ExtraSensory  Perception.)  In  any 
case,  the  processor  is  not  aware  of  the  ESP  controller  (except  for  time  delays);  it 
operates  as  if  it  had  a  long  bus  linking  it  to  all  the  memory  units.  Each  local 
memory  module  must  know  the  addresses  of  the  data  it  holds,  honor  requests  for 
its  data,  and  ignore  all  other  requests.  (This  is  how  memory  modules  in  a  con¬ 
ventional  architecture  operate.) 

While  the  common  program  generates  requests  for  data  local  to  machine  m, 
the  processor  at  m  takes  the  lead.  All  other  processes  continue  execution  at  the 
same  rate  as  m,  with  their  ESPs  supplying  the  data  they  need.  These  "trailing" 
processors,  will  be  behind  the  leader  by  an  amount  of  time  equal  to  the  one-way 
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delay  time  between  ESPs  through  the  system  bus.  When  a  reference  to  an 
address  local  to  another  machine  occurs,  that  machine  takes  the  lead. 

Writes  to  memory  can  be  ignored  by  the  ESPs.  When  the  program  calls  for 
storing  into  the  replicated  address  space,  all  processors  will  execute  the  instruc¬ 
tion  and  will  update  their  copies.  When  the  program  modifies  non-replicated 
storage,  the  processor  with  the  data  will  modify  it,  and  the  rest  need  do  nothing. 
(When  we  discuss  reliability  in  Section  4,  we  will  see  that  special  precautions 
must  be  taken  when  writing  into  the  non-replicated  address  space.) 

The  replicated  address  space  is  used  to  store  the  program  and  commonly 
accessed  values.  In  addition,  each  processor  may  have  registers  and  a  cache  to 
hold  recently  accessed  data. 

Two  important  things  to  note  about  the  system  bus  are  that  it  acts  as  the 
system  “clock"  and  that  there  is  no  contention.  The  data  transmitted  over  the 
bus  are  the  timing  signals  that  keep  all  processors  in  synchrony.  (In  the  example 
of  figure  4,  processor  2  picks  up  the  lead  when  it  receives  word  tv4  from  processor 
3.)  Since  non-replicated  data  is  found  only  at  a  single  machine,  only  one  ESP  will 
ever  broadcast  at  a  time.  This  means  that  the  bus  protocols  will  be  very  simple, 
and  hence  transmissions  can  be  fast. 

The  ESP  architecture  has  the  following  advantages  over  a  conventional  one: 

(1)  The  local  machines  have  conventional  architectures.  They  may  be  used 
independently  when  the  MMM  is  not  needed 

(2)  For  fully  random  references,  memory  access  times  are  cut  by  roughly  a  fac¬ 
tor  of  two.  In  a  conventional  machine,  the  address  must  be  transmitted  on 
the  system  bus  and  the  referenced  datum  must  be  transmitted  back.  In  an 
ESP  machine,  no  addresses  have  to  be  transmitted  on  the  global  bus:  each 
datum  appears  on  the  system  bus  without  having  been  requested.  That  is, 
since  references  are  random,  each  memory  access  will  cause  a  lead  change. 
But  these  lead  changes  only  involve  a  one-way  broadcast,  and  thus,  half  the 
delay  encountered  in  a  conventional  architecture. 
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(3)  Tbe  ESP  MMM  will  reward  “locality  of  reference”  by  minimizing  “lead 
changes”  in  programs  that  exhibit  it.  That  is,  if  two  or  more  references  fall 
within  the  same  memory  module,  then  the  access  times  are  reduced  to  local 
bus  times.  The  fewer  the  lead  changes,  the  faster  the  ESP  MMM  will  exe¬ 
cute. 

Locality  in  this  context,  however,  has  a  wider  meaning  than  in  a  conven¬ 
tional  memory  cache  or  virtual  storage  system.  Here,  locality  of  reference 
means  that  two  references  are  local  to  the  lead  machine,  and  this  machine 
may  have  a  substantial  chunk  of  memory  (probably  tens  of  megabytes).  In 
tbe  next  sub-section  we  will  explore  these  issue  in  more  detail. 

What  is  the  price  we  pay  for  these  advantages?  Obviously,  we  have  repli¬ 
cated  processors  and  some  data.  Given  current  pricing  trends,  the  cost  of  this 
extra  hardware  should  be  reasonable,  at  least  compared  to  the  cost  of  the  massive 
memory.  What  we  have  not  sacrificed  is  simplicity  and  ease  of  programming. 
The  processors  and  memory  modules  are  conventional.  The  ESP  architecture  is 
transparent  to  the  user  program.  The  task  of  distributing  the  global  address 
space  to  the  spaces  of  the  individual  machines  can  be  relegated  to  a  sophisticated 
loader. 

3.2.  Program  Locality. 

Tbe  potential  performance  improvements  of  an  ESP  MMM  over  one  with  a 
conventional  architecture  hinge  on  two  main  factors: 

(1)  The  “locality”  exhibited  by  the  program,  and 

(ii)  The  memory  access  times  over  the  system  and  local  busses. 

In  this  sub-section  we  study  the  first  factor  in  more  detail.  The  bus  tiroes  are 
discussed  in  the  following  sub-section. 

The  ESP  MMM  utilizes  several  mechanisms  to  improve  memory  access 
times:  (1)  registers  and  caches  at  each  processor  to  hold  recently  accessed  values; 

(2)  a  replicated  address  space  to  hold  the  program  and  commonly  accessed  values; 
and  (3)  the  ESP  mechanism,  which  lets  the  leading  or  controlling  processor  move 
to  the  memory  module  where  tbe  data  resides.  The  first  two  mechanisms  can  bo 
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easily  incorporated  to  a  conventional  MMM,  so  the  decisive  factor  is  clearly  the 
ESP  mechanism. 

What  does  the  ESP  mechanism  give  us  that  the  others  do  not?  In  order  to 
answer  this  question,  let  us  postulate  a  simple  data  reference  pattern.  (We  are 
not  interested  in  the  instruction  reference  pattern,  since  the  entire  program  is 
replicated  in  all  machines.) 

Suppose  that  the  M  memory  words  of  the  MMM  are  divided  into  blocks  of  B 
words  each.  A  block  is  the  unit  of  data  transfer  between  the  memory  and  a 
cache.  We  assume  that  the  location  of  the  next  referenced  block  depends  only  on 
the  location  of  the  most  recently  accessed  one.  Specifically,  Figure  5  gives  the 
probability  distribution  of  the  next  reference.  There  is  a  set  of  a  blocks,  centered 
on  the  last  referenced  block,  that  have  a  high  probability  p  of  being  accessed 
next.  All  other  blocks  have  a  much  lower  probability  q.  (For  simplicity,  we 
assume  that  when  the  last  reference  is  within  a/2  blocks  of  the  ends  of  the 
memory,  the  distribution  wraps  around.)  We  assume  that  a  is  odd. 


#«* c< 


Fig.  6:  The  Probability  Distribution. 

Our  experience  tells  us  that  this  is,  in  an  idealized  way,  the  way  programs 
reference  their  data  (e.g.,  see  [Siss68.  Smit82|).  For  example,  consider  a  program 
that  simulates  a  VLSI  chip.  W’ben  a  transistor  is  referenced,  several  contiguous 
words  may  be  referenced.  The  next  transistor  reference  is  likely  to  be  to  a  con¬ 
nected  one,  and  if  the  circuit  is  represented  in  a  reasonable  way,  it  will  be  close 
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to  the  previous  one.  Here  “close”  may  mean  within  a  few  thousand  bytes,  so  our 
high  probability  window,  a,  may  be  relatively  large. 

The  parameters  a  and  p  define  the  locality  of  the  program.  As  a  shrinks 
and/or  p  grows,  the  program  exhibits  more  locality,  and  as  a  grows  and/or  p 
approaches  q,  the  references  become  more  random  (i.e.,  the  distribution  becomes 
flatter). 

Note  that  this  distribution  ignores  other  types  of  data  locality  that  may  also 
be  exhibited  by  programs.  For  instance,  programs  may  have  time  locality  (i.e., 
tend  to  reference  recently  accessed  data)  or  may  access  certain  fixed  locations 
with  high  probability.  Since  these  types  of  localities  are  exploited  by  data 
caches,  the  distribution  we  have  selected  to  study  will  highlight  the  strengths  of 
the  ESP  mechanism,  not  of  caches.  This  is  precisely  what  we  want  to  do. 

Using  this  probability  distribution,  we  have  analyzed  the  performance  of  an 
ESP  mechanism  (where  processors  have  no  registers  or  caches)  and  of  a  simple 
cache.  The  analysis  is  described  in  [Garc83j.  Figure  6  presents  some  typical 
results.  The  figure  shows  the  hit  ratio  for  the  cache  (h()  and  the  ESP  mechanism 
(A,),  as  a  function  of  a ,  the  high  probability  window.  For  the  cache,  the  hit  ratio 
is  the  probability  that  the  next  referenced  word  is  in  the  cache.  For  the  ESP,  it 
is  the  probability  that  the  next  word  falls  in  the  same  machine  as  tne  previous 
word.  (In  the  figure,  locality  decreases  from  left  to  right.) 

If  on  each  memory  reference  the  cache  can  fetch  a  significant  portion  of  the 
“high  probability  of  next  access”  window,  then  the  cache  performs  very  well 
(That  is,  if  a  is  close  to  1  block.)  In  this  case,  either  the  program  has  very  high 
locality  or  the  system  bus  feeding  the  cache  is  very  wide.  In  this  case  the  ESP 
does  not  have  any  advantages  over  the  cache. 

At  the  other  extreme  (very  large  a),  references  are  fully  random  and  both 
mechanisms  have  a  hit  ratio  of  0.  In  this  range,  the  ESP  is  superior  by  roughly  a 
factor  of  two  because,  as  we  discussed  earlier,  addresses  need  not  be  bror  icast. 

In  between  is  a  large  range  of  localities  where  the  ESP  performs  substan¬ 
tially  better  than  the  cache  (from  a  equal  to  4  or  5  until  a  is  roughly  the  number 
of  blocks  in  a  memory  module  of  the  ESP.)  In  this  area,  most  references  using  the 
ESP  mechanism  are  local.  On  the  other  hand,  with  a  cache,  most  references 
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Fig.  6:  Hit  Patios  for  ESP  and  Cache 

continue  to  rely  on  the  system  bus.  This  is  because  the  cache  mechanism 
retrieves  data  from  memory  in  very  small  units,  on  the  order  of  a  few  words. 
The  improvement  will  be,  roughly,  the  ratio  of  system  bus  access  times  to  local 
bus  times. 

The  programs  that  will  use  a  MMM,  as  we  argued  in  Section  1,  are  memory- 
intensive  ones,  programs  that  cause  a  virtual  memory  system  to  thrash.  Thus  we 
expect  these  programs  to  operate  in  the  range  of  localities  where  the  ESP 
mechanism  does  pay  off. 

([Garc83]  presents  more  results,  and  also  considers  other  probability  distribu¬ 
tions.  The  trends  obtained  are  similar  to  what  we  have  presented  here.) 
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3.3.  System  and  Local  Bus  Access  Times. 

The  performance  improvements  of  an  ESP  MMM  over  a  conventional  archi¬ 
tecture  depend  on  the  value  of  the  system  bus  access  time,  P ,  and  the  local  bus 
time,  d.  If  we  ca’  implement  a  system  bus  with  D  small  compared  to  the  cycle 
time  of  the  processor(s),  then  cutting  this  time  by  a  factor  of  two  or  more  may 
not  be  important.  Similarly,  if  d  is  not  significantly  lower  than  D  (as  we  have 
assumed  so  far),  then  the  gains  of  the  ESP  mechanism  will  be  limited. 

The  values  of  d  and  D  depend  on  the  hardware  used  to  implement  the 
MMM,  as  well  as  on  the  size  of  the  memory.  Thus,  it  is  difficult  to  reach  any 
definitive  conclusions.  However,  we  can  discuss  two  implementation  scenarios 
where  certainly  D  is  significant  as  compared  to  the  cycle  time,  and  where  d  is 
orders  of  magnitude  less  than  A  In  both  of  these  cases,  the  ESP  MMM  performs 
very  well. 

•  Processor  and  Memory  on  a  Chip.  It  will  soon  be  possible  to  build  a 
reasonable  processor  with  a  few  megabytes  of  memory,  all  on  a  single  \TSI 
chip.  These  chips  will  be  ideally  suited  for  the  construction  of  an  ESP 
MAIM.  The  time  to  access  011-chip  memory  (<f)  will  be  very  small  since 
small  currents  and  small  distances  are  involved. 

The  limiting  factor  in  this  implementation  will  be  the  rate  at  which  ESPs 
can  broadcast  data  out  of  the  chip,  into  the  system  bus.  However,  an  opti¬ 
cal  bus  may  provide  the  necessary  throughput. 

•  Sharing  Memory  on  Existing  Computers.  Suppose  that  we  already 
have  an  installation  with  several  computers  (maybe  2  or  3,  maybe  100  or 
200)  connected  via  a  local  area  network.  The  ESP  architecture  gives  us  a 
way  to  combine  these  resources  into  a  single  MMM,  when  it  is  needed. 
Clearly,  local  memory  access  times  are  significantly  less  than  transmission 
times  over  the  network,  so  the  ESP  is  a  useful  idea.  Each  existing  machine 
would  be  provided  with  an  ESP  controller,  and  the  network  protocols  (for 
MMM  operation)  would  be  simplified,  e  g  ,  there  is  no  contention,  no  need 
for  packet  headers.  (This  assumes  that  while  the  machines  operate  as  a 
MMM,  the  network  has  no  other  users.)  A  program  requiring  more  memory 
than  is  available  at  a  single  machine  (even  if  it  only  needs  the  memory  of  3 
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or  4  other  machines)  can  be  sped  up  considerably.  There  will  be  improve¬ 
ments  even  if  its  references  are  totally  random,  since  page  faults  (with  seek, 
rotational,  and  substantial  data  transfer  delays)  will  be  replaced  by  fast  {and 
probably  short)  network  messages. 

For  some  programs  it  may  be  possible  to  implement  the  ESP  mechanism 
fully  in  software.  If  a  program  has  a  distribution  similar  to  the  one  of  the 
previous  sub-section,  and  if  a  is  less  than  the  memory  at  each  computer, 
then  lead  changes  will  be  infrequent.  A  lead  change  can  then  be  imple¬ 
mented  by  sending  a  message  with  the  state  (e.g.,  contents  of  registers)  of 
the  lead  machine  to  the  next  leader. 

4.  CONCLUDING  REMARKS. 

If  we  look  at  the  ratio  of  memory  size  to  processor  speed  of  past  and  present 
commercial  computers,  we  find  that  most  are  within  an  order  of  magnitude  of 
one  megabyte  per  MIPS.  (The  value  one  megabyte  per  MIPS  is  called  “Amdahl’s 
constant”.)  The  supercomputers  being  developed  all  have  ratios  well  below  this 
value,  and  are  targeted  for  computationally  intensive  problems.  The  machine  we 
proposed  here,  on  the  other  hand,  would  have  a  memory  to  speed  ratio  of  100, 
1000  or  more.  We  h  ave  argued  that  such  a  machine  would  speed  up  memory 
bound  programs  like  no  other  computer  could.  We  also  asserted  that  a  massive 
memory  machine  having  unconventional  architecture  and  features  would  be  more 
efficient.  Yet,  in  spite  of  its  novel  structure,  this  machine  would  be  simple  to 
program. 

We  h  ave  only  sketched  the  main  features  of  a  massive  memory  machine  and 
the  ESP  architecture,  but  of  course,  there  are  many  other  important  issues  that 
must  be  resolved  before  such  a  machine  can  become  a  reality. 

One  of  these  issues  is  reliability.  Fortunately,  the  ESP  architecture  appears 
to  be  well  suited  for  failure  tolen  The  state  of  a  computation  (e  g.,  registers 
and  program  counter)  is  replicated  at  all  processors,  so  if  one  of  them  fails,  its 
state  can  be  reconstructed.  The  processors  are  connected  through  a  simple  linear 
bus,  so  it  is  not  difficult  to  have  spare  units  that  can  take  over  when  others  fail 
A  major  portion  of  the  main  memory  is  not  replicated,  so  if  it  fails  data  will  1 
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lost.  Thus,  if  this  data  b  important,  a  secondary  copy  must  be  kept.  Meehan* 
bins  similar  to  those  used  in  database  systems  (e.g.,  logging)  can  be  used  to  keep 
the  secondary  copy  up  to  date.  These  strategies  and  mechanbms  are  discussed 
in[Garc83b). 

A  second  issue  b  the  utilization  of  the  multiple  processors  in  the  ESP  archi¬ 
tecture.  Given  that  they  exbt,  they  can  also  be  used  for  parallel  processing.  For 
example,  in  a  database  application,  the  MMM  could  be  divided  up  for  executing  a 
parallel  search,  then  reconstituted.  Of  course,  the  programs  for  the  multiple  pro¬ 
cessors  will  not  be  as  simple  as  the  MMM  ones. 
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